Compiler optimization of dataflow applications using mixed integer equations

ABSTRACT

A method comprises a compiler generating a MI (mixed integer) model to determine mapping decisions to map a dataflow application to hardware of a computing system to execute the application. The MI model comprises MI equations to solve by an MI solver. The MI equations include equations of an objective function corresponding to an optimization objective. The MI equations can comprise decision variables and equations and constraint variables and equations. The compiler outputs the MI model to the MI solver and invokes the MI solver to compute an MI solution comprising solutions to equations among the equations included in the MI model. The compiler receives the MI solution and generates a globally optimized mapping decision based on the MI solution. The MI solver can comprise a commercial program to solve MI linear equations. A computer program product and a computing system can implement the method.

CROSS-REFERENCE AND INCORPORATIONS

This application claims the benefit of U.S. Provisional Patent Application No. 63/327,313 filed Apr. 4, 2022, which is incorporated by reference herein in its entirety.

This application further claims the benefit of U.S. Provisional Patent Application No. 63/330,730 filed Apr. 13, 2022, which is incorporated by reference herein in its entirety.

This application further claims the benefit of U.S. Provisional Patent Application No. 63/330,740 filed Apr. 13, 2022, which is incorporated by reference herein in its entirety.

This application further claims the benefit of U.S. Provisional Patent Application No. 63/326,206 filed Mar. 31, 2022, which is incorporated by reference herein in its entirety.

This application further claims the benefit of U.S. Provisional Patent Application No. 63/326,762 filed Apr. 1, 2022, which is incorporated by reference herein in its entirety.

The following are incorporated by reference for all purposes as if fully set forth herein:

Prabhakar et al., “Plasticine: A Reconfigurable Architecture for Parallel Patterns,” ISCA ‘17, Jun. 24-28, 2017, Toronto, ON, Canada;

Koeplinger et al., “Spatial: A Language and Compiler for Application Accelerators,” Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Proceedings of the 43rd International Symposium on Computer Architecture, 2018.

U.S. Nonprovisional patent application Ser. No. 16/239,252, filed Jan. 3, 2019, titled “VIRTUALIZATION OF A RECONFIGURABLE DATA PROCESSOR,” (Attorney Docket No. SBNV 1000-1);

U.S. Nonprovisional patent application Ser. No. 16/536,192, filed Aug. 8, 2019, entitled “COMPILER FLOW LOGIC FOR RECONFIGURABLE ARCHITECTURES,” (Attorney Docket No. SBNV 1006-1);

U.S. Nonprovisional patent application Ser. No. 16/572,527, filed Sep. 16, 2019, entitled “PERFORMANCE ESTIMATION-BASED RESOURCE ALLOCATION FOR RECONFIGURABLE ARCHITECTURES,” (Attorney Docket No. SBNV 1016-2);

U.S. patent application Ser. No. 16/922,975, filed Jul. 7, 2020, titled “RUNTIME VIRTUALIZATION OF RECONFIGURABLE DATA FLOW RESOURCES,” (Attorney Docket No. SBNV 1026-1;

U.S. Nonprovisional patent application Ser. No. 17/216,651, filed Mar. 29, 2021, entitled “LOSSLESS TILING IN CONVOLUTION NETWORKS—TILING CONFIGURATION,” (Attorney Docket No. SBNV 1034-2);

U.S. Nonprovisional patent application Ser. No. 17/216,652, filed Mar. 29, 2021, entitled “LOSSLESS TILING IN CONVOLUTION NETWORKS—SECTION BOUNDARIES,” (Attorney Docket No. SBNV 1034-3);

U.S. Nonprovisional patent application Ser. No. 17/384,507, filed Jul. 23, 2021, entitled “LOSSLESS TILING IN CONVOLUTION NETWORKS— BACKWARD PASS,” (Attorney Docket No. SBNV 1034-9); and,

US Nonprovisional Patent Application titled “SEARCHING CONVOLUTIONAL NETWORK NODES BASED ON NAMED MATRIX DIMENSIONS,” Attorney Docket No. SBNV1109USN01, by Yang, et al.

FIELD OF THE TECHNOLOGY

The technology disclosed relates to compilers for data parallel and dataflow applications, such as convolutional neural networks, machine learning applications, and artificial intelligence computing systems. In particular, the technology disclosed relates to compilers for computing systems using reconfigurable processors, such as coarse-grain reconfigurable processors to execute convolutional neural networks and other dataflow computing applications.

BACKGROUND

The present disclosure relates to compilers for data parallel and dataflow applications, such as convolutional neural network and machine learning applications. In particular the present disclosure relates to determining allocation of computing system hardware resources to execute such applications.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings included in the present disclosure are incorporated into, and form part of, the specification. They illustrate implementations of the present disclosure (hereinafter, “the disclosure) and, along with the description, serve to explain the principles of the disclosure. The drawings are intended to be only illustrative of certain implementations and are not intended to limit the disclosure.

FIG. 1 illustrates an example coarse-grain reconfigurable (CGR) system (CGRS), according to aspects of the disclosure.

FIG. 2 illustrates an example sub-graph, according to aspects of the disclosure.

FIG. 3 illustrates an example compiler stack, according to aspects of the disclosure.

FIG. 4A illustrates an example mapping decision space, according to aspects of the disclosure.

FIG. 4B illustrates an example structure of a model analyzer and compiler, according to aspects of the disclosure.

FIG. 5 illustrates an example method for performing multiple decision passes by a CGRS compiler, according to aspects of the disclosure.

FIG. 6 illustrates an example method of a compiler to generate a CGRA resource mapping using a Mixed Integer (MI) mapping model, according to aspects of the disclosure.

FIG. 7 illustrates an example system for performing methods of the disclosure, and/or operators thereof, according to aspects of the disclosure.

FIG. 8A illustrates an example method to tile operands and partition operators of application model, according to aspects of the disclosure.

FIG. 8B illustrates an example method determine CGR hardware mapping proposals, according to aspects of the disclosure.

FIG. 9A illustrates an example application graph, according to aspects of the disclosure.

FIG. 9B illustrates an example of convolutional partitioning and de-partitioning, according to aspects of the disclosure.

FIG. 9C illustrates an example of shape fulfillment, according to aspects of the disclosure.

FIG. 9D illustrates an example intermediate representation of the application graph of FIG. 9A with convolutional partitioning and de-partitioning and operators organized in sections, according to aspects of the disclosure.

FIG. 10 illustrates an example system for performing heuristic resource allocation, according to aspects of the disclosure.

In the figures, like reference numbers can indicate functionally similar elements. The systems and methods illustrated in the figures, and described in the Detailed Description below, can be arranged and designed in a wide variety of different implementations. Neither the figures nor the Detailed Description are intended to limit the scope of the claims. Instead, they merely represent examples of different implementations of the disclosed technology.

SUMMARY

A method comprises A method comprises a compiler generating a MI (mixed integer) model to determine mapping decisions to map a dataflow application to hardware of a computing system to execute the application. The MI model comprises MI equations to solve by an MI solver. The MI equations include equations of an objective function corresponding to an optimization objective. The MI equations can comprise decision variables and equations and constraint variables and equations.

The compiler outputs the MI model to the MI solver and invokes the MI solver to compute an MI solution comprising solutions to equations among the equations included in the MI model. The compiler receives the MI solution and generates a globally optimized mapping decision based on the MI solution. The MI solver can comprise a commercial program to solve MI linear equations.

A computer program product and a computing system can implement the method. The computing system can comprise a graph corresponding to a dataflow application, a hardware specification describing hardware of a second computing system for executing the dataflow application, a first processor and a second processor, an MI (Mixed Integer) Solver, and a compiler. The compiler can execute on the first processor to generate the MI model; output the MI model to the MI solver; invoke the MI solver to compute the MI solution; receive the MI solution; and, generate a globally optimized mapping decision based on the MI solution. The MI Solver can execute on the second processor to: access the MI model; solve equations among the MI equations; and, output the MI solution.

DETAILED DESCRIPTION

Aspects of the present disclosure (hereinafter, “the disclosure”) relate to methods of compiling neural network applications for execution on computing systems utilizing reconfigurable dataflow processing elements, in particular utilizing coarse-grain reconfigurable processors (CGRPs). More particular aspects relate to determining mappings of neural network operators and data flow to CGRP processing and/or memory elements, and/or configurations of CGRP processing and/or memory elements. Implementations of the disclosure (hereinafter, “implementations”) can analyze a computation graph of a machine learning model to determine alternative mappings.

Processing elements that implement aspects of the disclosure can include processors of data parallel (DP) and/or dataflow computing systems, such as Central Processing Unit (CPUs), Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), and Digital Signal Processors (DSPs). Certain aspects of the disclosure relate to executing neural networks on computing systems utilizing reconfigurable processor architectures, such as CGRPs, reconfigurable Application Specific Integrated Circuits (ASICs), and/or Application Specific Instruction-set Processors (ASIP).

Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. The disclosure in some instances repeats references to these options. However, omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.

Particular expressions of the disclosure will be understood to have the following operative meanings:

-   -   The phrases “at least one”; “one or more”; and “and/or” are to         be understood as open-ended expressions that operate both         conjunctively and disjunctively. For example, each of the         expressions “at least one of A, B, and C”, “at least one of A,         B, or C”, “one or more of A, B, and C”, “one or more of A, B, or         C”, and “one or more of A, B, and/or C” means A alone, B alone,         C alone, A and B together, A and C together, B and C together,         or A, B, and C together.     -   The term “a” or “an” entity refers to one or more of that         entity. As such, the terms “a”/“an”, “one or more”, and “at         least one” can be used interchangeably herein.     -   The terms “comprising”, “including”, and “having” can be used         interchangeably herein.

Unless otherwise specified, the use of ordinal adjectives first, second, third, etc., to describe an object, merely refers to different instances or classes of the object and does not imply any ranking or sequence.

As used herein, “incorporated subject matter” refers, collectively, to subject matter disclosed, and/or otherwise encompassed, among the disclosures incorporated herein by reference. For purposes of illustrating the disclosure, but not intended to limit implementations, various terms of the disclosure are drawn from the incorporated subject matter. As used herein, unless expressly stated otherwise, such terms as can be found in the incorporated subject matter have the same meanings, herein, as their meanings in their respective incorporated disclosures.

Aspects of the disclosure can be appreciated through a discussion of example implementations and/or applications of methods and/or systems. However, such examples are for purposes of illustrating the disclosure. It should be understood that the intention is not to limit the disclosure to the example implementations described herein, but to encompass all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure. Thus, the disclosure is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein. Various modifications to the disclosed examples will be readily appreciated by those of ordinary skill in the art, and the general principles defined herein can be applied to other implementations of the disclosure without departing from the spirit and scope of the disclosure.

The disclosure uses terms and acronyms related to the field of the technology, defined, at least in part, herein as:

AI— artificial intelligence.

AIR— arithmetic or algebraic intermediate representation.

ALN— array-level network.

Application Model—In machine learning applications, “application model” commonly refers to a mathematical representation of a machine learning application. An application model can comprise an application graph and/or textual (e.g., high level, intermediate level, and/or low level programming language) representation. An application model can represent a set of mathematical operators (compute functions of an application) and a flow of data between the operators, and can represent the operators and dataflow graphically and/or textually. As used herein, “application model” or, simply, “model” refers interchangeably to an application itself (e.g., high level programming statements of an application) and a graphical and/or textual representation of the application's compute functions and/or dataflow.

Buffer—an intermediate storage of data.

CGR— coarse-grained reconfigurable. A property of, for example, a system, a processor, an architecture (see CGRA), an array, or a unit in an array. This property distinguishes the system, etc., from field-programmable gate arrays (FPGAs), which can implement digital circuits at the gate level and are therefore fine-grained configurable.

CGRA— coarse-grained reconfigurable architecture. A data processor architecture that includes one or more arrays (CGR arrays) of CGR units.

CGR unit—a circuit that can be configured and reconfigured to locally store data (e.g., a memory unit or a partition memory unit, such as described in Prabhakar), or to execute a programmable function (e.g., a processor or other compute unit, or a partition compute unit such as described in Prabhakar). A CGR unit includes hardwired functionality that performs a limited number of functions used in computation graphs and dataflow graphs. Some implementations include switches to route data among CGR units.

CGR Array—an array of CGR units, coupled with each other through an array-level network (ALN), and coupled with external elements via a top-level network (TLN). In implementations a CGR array can physically implement the nodes and edges of a computation and/or dataflow graph.

CGRP— Coarse-grain reconfigurable processor. As used herein, CGRP refers to a processor, or processing element, based on a CGRA— such as an integrated circuit, chip, or module based on, or incorporating, a CGRA— and/or incorporates a CGR unit, CGR array, or elements of a CGR unit and/or a CGR array.

CGR Components—As used herein, “CGR components” refers, collectively, to hardware resources or elements of CGR units, CGR arrays, and CGRP; memories of CGR units/arrays/processors; and, networks and/or I/O interconnections and interface hardware interconnecting CGR units/arrays/processors and/or memories, such as Ethernet networks/interfaces, I/O buses/interfaces, such as PCI-Express buses, InfiniBand buses/interfaces, and/or memory or data buses/interfaces, such as buses of a processor and/or memory fabric, and related interface hardware).

CGR hardware—As used herein, the terms “CGR hardware” and “CGR hardware resources” refer to any individual hardware element, or combination of hardware elements, of CGR components of a CGRS.

CGRS— a computing system comprising CGR units and/or CGRPs. As used herein, CGRS refers to a computing system that is based on, and/or can utilize, reconfigurable computing resources, such as CGR arrays, CGR units, and/or CGRPs, to perform operations of data parallel and/or dataflow applications. U.S. Nonprovisional patent application Ser. No. 16/239,252, “VIRTUALIZATION OF A RECONFIGURABLE DATA PROCESSOR”, to Grohoski, et al, (hereinafter, “Grohoski”), and U.S. Nonprovisional patent application Ser. No. 16/922,975, “RUNTIME VIRTUALIZATION OF RECONFIGURABLE DATA FLOW RESOURCES”, to Kumar, et al, (hereinafter, “Kumar”), both incorporated herein by reference, illustrate example implementations of CGR arrays, CGR units, CGRPs, and CGR systems.

Chip—As used herein, the term “chip” refers to an IC (or, combination of ICs) that can embody elements of a CGRA. A chip can typically be packaged in a chip module (e.g., a single chip module, “SCM” or, alternatively, a multi-chip module, “MCM”).

Compiler—a translator that processes statements written in a programming language to machine language instructions for a computer processor. A compiler can include multiple stages to operate in multiple steps. Each stage can create or update an intermediate representation (IR) of the translated statements. Compiler stages are illustrated with reference to FIG. 3 .

Computation graph/Graph—As used herein, computation graph refers to a type of directed graph comprising nodes and edges connecting the nodes, to represent a dataflow application. In a neural network application nodes can represent mathematical operations/expressions and edges that indicate dependencies between the operations/expressions. For example, in machine learning (ML) algorithms, input layer nodes can assign variables, output layer nodes can represent algorithm outcomes, and hidden layer nodes can perform operations on the variables. Edges can represent data (e.g., scalars, vectors, tensors) flowing between operations. In addition to dependencies, the computation graph reveals which operations and/or expressions can be executed concurrently.

Dataflow Application—As used herein, the term “dataflow” application refers interchangeably to data parallel and dataflow applications. such as ML, AI, and other massively parallel computing applications.

Dataflow/Application Graph—a computation graph, or portion of a computation graph, corresponding to operators (application compute functions), data, and flow of data among operators, of a dataflow application that includes one or more loops of operator nodes that can be nested, and wherein nodes can send messages to nodes in earlier (predecessor) layers to control the dataflow between the layers.

IC— integrated circuit—a monolithically integrated circuit, i.e., a single semiconductor die which can be delivered as a bare die or as a packaged circuit. For the purposes of this document, the term integrated circuit also includes packaged circuits that include multiple semiconductor dies, stacked dies, or multiple-die substrates. Such constructions are now common in the industry, produced by the same supply chains, and for the average user often indistinguishable from monolithic circuits.

Intermediate Representation (IR)— an Intermediate Representation is a representation of an application in an intermediate langue. An IR can incorporate partial compilation results, such as sections (groupings) of a graph or model, pipelines that can be formed within a graph or model, mappings of application functions or graph nodes/edges to hardware resources of a CGRS.

Logical CGR—A logical CGR array or logical CGR unit comprises a representation of a CGR array or a CGR unit that is physically realizable, but that may not, at a particular time in executing a dataflow application, have been assigned to a physical CGR array or to a physical CGR unit on an IC.

ML— machine learning.

PEF— processor-executable format—a file format suitable for configuring a configurable data processor.

Pipeline—a staggered flow of computational operations through a chain of pipeline stages in which the operations can be executed in parallel. In an application graph, a pipeline can comprise a set of operator nodes that can pipeline operations of the graph.

Pipeline Stages—a pipeline can be divided into stages that are coupled with one another as predecessor/successor stage to form a pipe topology.

PNR— place and route—the assignment of logical CGR units and associated processing/operations to physical CGR units in an array, and the configuration of communication paths between the physical CGR units.

RAIL— reconfigurable unit abstract intermediate language.

RP— reconfigurable processor. An RP can comprise, for example, field programmable gate arrays (FPGAs), graphic processing units (GPUs), and/or CGRPs.

TLIR— template library intermediate representation (IR).

TLN— top-level network.

Turning now to more particular aspects of the disclosure, high-level programs for machine learning (ML) and artificial intelligence (AI) can require massively parallel computations, where many parallel and interdependent computation threads (pipelines) exchange data. Such programs are ill-suited for execution on traditional, Von Neumann architecture computers. Rather, these applications can require architectures optimized for parallel and pipeline processing, such as CGRAs or graphic processing units (GPUs).

The ascent of dataflow applications such as ML and AI, and massively parallel architectures (such as CGRAs) places new and complex requirements to execute the applications, or computations of the applications, on CGR hardware. Such requirements can include how computations of an application are pipelined, which computations are assigned to which compute units, how data is routed between various compute units and memories, and how synchronization among processors, memories, and data transfer hardware is controlled, particularly when a dataflow applications includes one or more nested loops, whose execution time can varies depending on the data being processed. The architecture, configurability and dataflow capabilities of CGR systems, and CGR components of CGR systems, enable increased compute power that supports both parallel and pipelined computation.

In implementations CGR components of a CGRS, for example, can be programmed to simultaneously execute multiple independent and interdependent operations. To enable simultaneous execution within a pipeline stage, and across pipeline stages, dataflow applications need to be distilled from a high-level program and translated to low level instructions to execute the program on hardware resources of reconfigurable dataflow systems, such as a CGRS. The low level instructions can comprise a configuration file describing a configuration of CGR components, as well as processor (e.g., CGRP) instructions and/or instructions for transferring application data among CGR components.

A high-level program is source code written in programming languages like Spatial, Python, C++, and C, and can use computation libraries for scientific computing, ML, AI, and the like. The high-level program and referenced libraries can implement computing structures and algorithms of machine learning models like AlexNet, VGG Net, GoogleNet, ResNet, ResNeXt, RCNN, YOLO, SqueezeNet, SegNet, GAN, BERT, ELMo, USE, Transformer, and Transformer-XL.

In computing applications, a compiler translates high-level programs to instruction executable by processors of a computing system. In a CGRS, a CGRS compiler can translate high-level programs to processor instructions, but also to executable instruction files and/or “bit files” describing configurations of CGR components to execute a dataflow application, or pipeline stages of a dataflow application. CGRS compilers require mapping application operations and data flow to CGR hardware components in both space (CGR hardware parallelism) and time (for synchronization of interdependent computations). This requirement implies that a CGRS compiler must determine which operations of a dataflow application are assigned to which of the CGR components, and how both data and, related to the support of computation and control information flow among CGR components, and to/from external hosts and storage. This process, known as “place and route”, is one of many new challenges posed to CGRS compilers.

FIG. 1 illustrates an example reconfigurable dataflow computing system 100 including a CGR processor 110, a host 180, and a memory 190. CGR processor 110 has a coarse-grained reconfigurable architecture (CGRA) and includes an array of CGR units 120 such as a CGR array. CGR processor 110 further includes an IO interface 138, and a memory interface 139. Array of CGR units 120 is coupled with IO interface 138 and memory interface 139 via data bus 130 which can be part of a top-level network (TLN). Host 180 communicates with IO interface 138 via system data bus 185, and memory interface 139 communicates with memory 190 via memory bus 195. Array of CGR units 120 can further include compute units and memory units that connected with an array-level network (ALN) to provide the circuitry for execution of a computation graph or a dataflow graph that can have been derived from a high-level program with user algorithms and functions. The high-level program can include a set of procedures, such as learning or inferencing in an AI or ML system. More specifically, the high-level program can include applications, graphs, application graphs, user applications, computation graphs, control flow graphs, dataflow graphs, models, deep learning applications, deep learning neural networks, programs, program images, jobs, tasks and/or any other procedures and functions that can need serial and/or parallel processing. In some implementations, execution of the graph(s) can involve using multiple units of CGR processor 110. In some implementations, CGR processor 110 can include one or more ICs. In other implementations, a single IC can span multiple CGR processors. In further implementations, CGR processor 110 can include one or more units of array of CGR units 120.

Host 180 can be, or can include, a computer such as illustrated in the examples of Grohoski and Kumar. Host 180 can execute runtime processes, as further referenced herein, and can also be used to run computer programs, such as a CGRS compiler. In some implementations, the compiler can run on a computer that is similar to the computer described in the examples of Grohoski and Kumar, but separate from host 180.

CGR processor 110 can accomplish computational tasks by executing a configuration file (for example, a PEF file). For the purposes of this description, a configuration file corresponds to a dataflow graph, or a translation of a dataflow graph, and can further include initialization data. A compiler compiles the high-level program to provide the configuration file. In some implementations described herein, a CGR array is configured by programming one or more configuration stores with all or parts of the configuration file. A single configuration store can be at the level of the CGR processor or the CGR array, or a CGR unit can include an individual configuration store. The configuration file can include configuration data for the CGR array and CGR units in the CGR array, and link the computation graph to the CGR array. Execution of the configuration file by CGR processor 110 causes the CGR array (s) to implement the user algorithms and functions in the dataflow graph.

CGR processor 110 can be implemented on a single integrated circuit die or on a multichip module (MCM). An IC can be packaged in a single chip module or a multichip module. An MCM is an electronic package that can comprise multiple IC dies and other devices, assembled into a single module as if it were a single device. The various dies of an MCM can be mounted on a substrate, and the bare dies of the substrate are electrically coupled to the surface or to each other using for some examples, wire bonding, tape bonding or flip-chip bonding.

Many dataflow applications, such as in ML and other types of AI applications, comprise neural networks (NNs). Examples of neural networks include fully connected neural networks (FCNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), convolutional neural networks (CVNNs), graph convolutional networks (GCNs), long short-term memory (LSTM) networks, autoencoders, deep belief networks, and generative adversarial networks (GANs).

In data parallel and dataflow applications, such as CVNNs, compute functions of the application are often referred to as operators. The compute functions perform computations, such as matrix computations using tensor data of the application, to execute the higher level processes of the application (e.g., object recognition in an image, natural language phrase interpretations or prediction, etc.). A neural network processes data according to a flow of computational input (operand) and computational output (results) data through layers of operators (neurons) of the NN.

Operators of an input layer can receive stimuli (e.g., input data), and the input and other (e.g., “hidden”) layers compute particular functions (e.g., an activation or loss function), and operators of an output layer output computational results. A particular layer of a CVNN comprises operators that perform the particular function computations of that layer. Example layers, and associated operators, of NNs include rectified linear unit (ReLU) layers, fully connected layers, recurrent layers, graphical network layers, long short-term memory layers, convolutional layers, kernel layers, dropout layers, and pooling layers.

A machine learning application requires “training” within a problem space the application is designed to recognize (e.g., subjects of images, audio, or video) or predict outcomes (e.g., natural language phrase completion, future values, etc.). Training a neural network can comprise determining and/or optimizing parameters associated with computations (e.g., activation functions) of the CVNN computed by operators within layers of the NN. Weights and biases, for example, can be parameters of a weights-bias activation function of a neural network. In training such an NN, a training (data parallel/dataflow) application can compute gradients of weights and biases, such as by using a loss-function, and can optimize the weights and biases based on an optimization algorithm such as gradient descent. Executing an ML application can utilize the optimized parameters to execute functions of the application.

Problem spaces of a machine learning application, and/or input of dataflow applications in general, can comprise enormous amounts of data, and can often comprise tensor data. Thus, functions of these applications (e.g., operators of neural networks) commonly involve linear algebra computations over tensor data, such as matrix multiplication, transposition, and addition. Algorithms commonly employed in dataflow applications include algorithms such as linear regression and gradient descent over tensors and/or matrices of tensors. Matrices of tensors data can comprise matrices of varying dimensions and a variety of computing systems, including dataflow computing systems, can perform matrix computations, such as GeMM, matrix summation, matrix transposition, gradient computations, and/or backpropagation of matrix computations, to process tensors in dataflow applications such as machine learning in neural networks.

As used herein, brackets and a capital letter, such as [M], is used to refer to a matrix as a whole, while lowercase letters, such as m, are used to refer to an element, or set of elements, of a matrix [M]. For example, an expression such as (w×a) refers, herein, to a multiplication of a set of elements of matrices [W] and [A], such as elements of a row of matrix [W] multiplied by elements of a corresponding column of matrix [A]. The term “element”, in reference herein to a matrix, refers to the contents (e.g., a scalar value) of a row and column cell of the matrix.

A common computation for processing tensors in dataflow applications is a sum of products (dot product) of two matrices. The products comprise products of elements of a row of one multiplicand matrix (a “left side” matrix_ multiplied by corresponding elements of a column of a second multiplicand (a “right side” matrix), where the row dimension of the left side matrix and the column dimension of the right side are the same (shared dimension.) As used herein, the term “dot product” refers to a sum of two or more products of a row of a left side matrix multiplicand by a column of a right side matrix. An expression such as (Σw a) refers to a sum-product of elements w and a (e.g., a sum of products w×a for elements of a row of a matrix [W] multiplied by elements of a column of a matrix [A]). As an example, a dot product of elements w₁₁ of matrix [W multiplied by au of matrix [A], and w₁₁ multiplied by ail of matrix [A], is [w₁₁×a₁₁+w₁₁×a₂₁].

A “matrix summation” computation, as used herein, refers to a matrix computation in which a dot product of two multiplicand matrices is added to a matrix addend. A matrix addend can comprise a constant or can comprise a matrix (which can itself be multiplied by a matrix multiplied by a constant) sharing a row dimension of the dot product of two multiplicand matrices. A “weight-bias function”, y=Σw a+b, is one example of such a computation, in which a weights matrix [W] is multiplied by an activation matrix [A] and the dot products, Σw a, for each row/column set of products, is added to elements of a bias matrix [B] . . . .

In implementations, a CGRP, and/or other CGR components of a CGRS, can perform computations (e.g., operators) of applications in a distributed fashion and/or can execute computations as dataflow pipelines that can efficiently exploit CGRS and application parallelism, and CGR component data locality. Dataflow pipelines of CGRS compute units (e.g., CGRPs and/or CGR arrays) can contain several computational stages, in which each stage can read data from one or more input buffers (e.g., buffers in CGR component memories), can perform computations on the data while using one or more internal buffers to store and retrieve intermediate results, can produce outputs, and can write the outputs to one or more output buffers.

Data parallel and dataflow computing applications can comprise tensor computations, usually involving enormous amounts of data, such as very large and/or numerous matrices of tensor data. For example, machine learning (ML) and other tensor-based applications can comprise a convolutional neural network (NN). While not intended to limit implementations, a convolutional neural network can serve to illustrate aspects of the disclosure. However, it will be appreciated by one of ordinary skill in the art that aspects of the disclosure can apply broadly to a variety of computing applications involving tensor data, and/or executed by data parallel and/or dataflow applications and computing systems.

A CVNN can comprise layers organized as a pipeline of computations using matrices of tensor data. A layer of the CVNN can comprise operators performing computations on matrices of tensor data. A particular operator of a CVNN (or, tensor-based application in general) can perform a matrix computation, such as Generalized Matrix Multiplication (“GeMM”), matrix convolution, and Rectified Linear Units (“ReLU”) corresponding to particular algorithms and/or functions of the application, such as an activation function, gradient descent function, and/or a loss function. A particular layer of a CVNN can comprise multiple processing elements, such as CGRPs, executing in parallel to perform operator computations of the application using subsets of tensor data. The processing elements of one layer of a CVNN can output results of their computations to a successor “forward” and/or “backward” layer of the NN.

Various types and/or combinations of computing systems can execute tensor-based applications, and/or operators of tensor-based applications, such as NNs. Data parallel (DP) and dataflow computing systems, particularly systems utilizing CGRPs, can be particularly efficient at executing tensor-based applications. CGRPs can individually, or in combination, execute functions and/or computations of application operators, in parallel and in pipelines, to efficiently execute an application and improve performance of application execution. As used herein, the term “reconfigurable dataflow system (DS)” refers, interchangeably, to data parallel and dataflow computing systems utilizing reconfigurable processors such as CGRPs. An CGRS can, for example, efficiently execute tensor-based applications such as convolutional neural networks, and can serve to illustrate aspects of the disclosure without limiting implementations.

A tensor-based application can include “operators” that perform computations such as linear regression, non-linear regression, Gaussian regression, Support Vector Machine (SVM) regression, Generalized Linear Models, regression trees, shallow and deep neural network models, logistic regression, decision tree, and, “K” nearest neighbor, using matrices of tensor data. One expression, or representation, of an application is a computation graph (hereinafter, for brevity, simply “graph”), which can be textual, graphical, or a combination of textual and graphical descriptions of operators, operands, and results of computations of the application. A graph can represent the operators (as compute nodes of the graph) of an application, and their arrangement and/or dependencies (e.g., flow of computational inputs and outputs) among the operators (as edges of the graph).

Data nodes of a graph can represent particular application data elements, such as input data for training an ML model. A graph can be a directed acyclic graph (DAG), or can comprise loops, and even nested loops, of operators. As used herein, except where otherwise qualified as “data node”, the term “node” is used herein interchangeably to refer to an operator of an application and a node representation of that operator in a graph.

Forward nodes of a graph can receive outputs of backward nodes (e.g., gradients), and backward nodes can receive updated outputs of forward nodes (e.g., outputs computed using outputs of backward nodes), creating feedback loops within the graph. As nodes within a feedback loop recompute outputs based on the feedback, such nodes are referred to herein as “recompute nodes”.

A pipeline of an application can comprise a set of forward operators and, optionally, set of backward operators (e.g., backpropagation operators). Each operator within a pipeline can process data output from a predecessor operator, generally in parallel with the predecessor operators as the predecessor operator outputs results of computations over a portion input data.

FIG. 2 illustrates an example of a computation graph corresponding to an application. As shown in FIG. 2 , forward and backward operators of an application can be grouped, such as for mapping the operators to CGR components for execution, as respective forward and backward sections of a graph. The sections can each represent nodes of the graph that do not have data dependencies among each other (that is, do not need to await complete computational results of another compute node), such that a CGRS can execute computations of the nodes in a pipeline topology among CGR components. Sections can particularly comprise operators that are “adjacent” and, based on not having data dependencies among each other, can form a pipeline. A “producer” operator and a “consumer” operator are adjacent operators in a graph if the producer can input results of the producer operator computations to an operand of the consumer operator. For example, a GeMM operator is adjacent to an ADD operator if a results matrix (or elements of a results matrix) of the GeMM operator can be direct input to the ADD operator.

In FIG. 2 , forward sections 210 is shown comprising Pipe 214A and Pipe 214B, and backward sections 220 is shown comprising Pipe 224A and Pipe 224B. Pipe 214A is shown comprising node CONV 212A, and Pipe 224B is shown comprising nodes RELU 212B, CONV 212C, RELU 212D, and MAXPOOL 212E (hereinafter, collectively “nodes 212). Names of nodes, such a “RELU”, can indicate a type of computation of the application performed by a node.

Edges of a graph can represent data flow between and into or out of the nodes. Thus, computational results of node CONV 212A can flow as inputs to node RELU 212B, computational results of node RELU 212B can flow as inputs to node CONV 212C, and so forth. Data nodes in a graph can represent data processed by compute nodes and flow of data into or out of the nodes (as also shown in FIG. 2 by directed arrows). In forward sections 210, FIG. 2 depicts data nodes OP DATA 202 and WEIGHT 204 as data input to CONV 212A, and WEIGHT 206 as data input to CONV 212C.

In FIG. 2 , backward sections 220 is shown comprising Pipe 2224A and Pipe 224B, Pipe 224A is shown comprising nodes CONV2D BWD 222A and RELU BWD 222B, and Pipe 224A is shown comprising nodes CONV2D BWD 222C, RELU BWD 222D, and MAXPOOL 222E. In backward sections 220, FIG. 2 depicts data node WEIGHT 206 as data input also to CONV2D BWD 222C. Backward nodes of a graph can represent nodes that receive outputs of forward nodes and compute a feedback function over those outputs. For example, a common backward computation is to compute gradients of weights and biases, and/or loss functions based on gradients of weights and biases, in a weights-bias activation function of a forward node. Backward nodes, can compute, for example, a gradient in an application that include gradient descent to optimize computations of forward nodes in a feedback loop. As shown, an output of backward sections 220 is data node output gradient 208, output node CONV2D BWD 222A.

In implementations, a “CGRS compiler” can compile a high-level language representing of a data parallel and/or dataflow application to configurations and/or execution instructions to execute the application. For brevity, hereinafter “application” is understood to refer to a data parallel or dataflow programming application for execution by a data parallel and/or dataflow computing system, such as a CGRS.

A CGRS compiler can, for example, transform an application into, and/or can utilize, a graph such as example graph 200 in FIG. 2 . Based on a graph of an application, a CGRS compiler can generate a search space, and can use the graph and/or search space to determine model operational parallelism and pipelining, and/or to map model dataflow (e.g., nodes and edges of a computation graph) to CGRS and/or CGR hardware resources and dataflow through the resources. A compiler can further transform resource mapping decisions into assembler input for generation of hardware instructions and/or hardware configuration files, such as a Processor Executable Format (PEF) file.

FIG. 3 is a block diagram of example compiler stack 300 comprising multiple compilation stages to compile a dataflow application for execution by a CGRS. As depicted in FIG. 3 , compiler stack 300 includes several stages to translate a high-level program, with (user) dataflow application algorithms and functions (e.g., ML algorithms and/or tensor computation functions), to configuration and/or instruction data for a CGRS to execute the application.

Compiler stack 300 can take its input from application platform 310, and/or any other source of high-level program statements of an application, which provides a user interface, such as an API and/or command line interface (CLI), for application developers to compile an application. A “user”, as used herein, can be any human or computing system that develops an application (e.g., programs the high-level programs of an application), and/or that can input an application into a CGRS compiler for translation to CGRS configurations and/or CGRS execution instructions.

Compiler stack 300 can further receive hardware description 315, which can comprise a textual and/or graphical description of CGRS and/or CGR hardware components of a CGRS. Compiler stack 300 can utilize hardware description 315 to translate the high-level programming statements of an application to configurations CGR components and/or execution instructions (e.g., instructions to a runtime processor to control execution, and/or processor instructions to execute functions, of an application) to execute the application.

Application platform 310 can comprise a computing system for developing an application and/or inputting an application for compilation by a CGRS compiler. For example, application platform 310 can comprise a computing system capable of hosting a user, such as host processor in the CGRS examples of Kumar. Application platform 310 can include libraries such as PyTorch, TensorFlow, ONNX, Caffe, and Keras to provide user-selected and configured algorithms.

Application platform 310 can output a high-level program of an application to compiler 320, which in turn can output a configuration file to runtime processes 330. Runtime processes 330 can comprise programs to configure CGR components, and/or manage execution of an application on CGR components, of a CGRS. The programs can execute on a runtime processor (e.g., one or more CPUs) of a CGRS.

Compiler 320 can include dataflow graph compiler 321, algebraic graph compiler 322, template graph compiler 323, template library 324, and placer and router PNR 325. In implementations, template library 324 can include a reconfigurable unit abstract intermediate language (RAIL), and/or assembly language interfaces (APIs) for power users.

Dataflow graph compiler 321 can analyze high-level programs, implementing user algorithms and application functions received from application platform 310, and can convert the high-level programs to one or more dataflow graphs. The high-level programs can be suitable for parallel and/or pipeline processing and nodes of the dataflow graphs can be intrinsically parallel unless an edge in the graph indicates a dependency. Dataflow graph compiler 321 can provide code optimization steps, such as false data dependency elimination, dead-code elimination, and numeric constant folding. The dataflow graphs can encode data and execution control dependencies of the high-level programs.

Dataflow graph compiler 321 can support programming a CGR components (e.g., CGRPs) using higher or lower-level programming languages, For example dataflow graph compiler 321 can support translation or conversion from an application platform 310 to C++ and/or an assembly language. In implementations, dataflow graph compiler 321 can allow programmers to provide code (e.g., machine language code) that runs directly on CGRPs and/or other CGR components. Dataflow graph compiler 321 can include one or more programming libraries, and the libraries can include predefined functions, such as linear algebra operations, element-wise tensor operations, non-linear functions, and reduction functions for creating, executing, and profiling dataflow graphs on the CGRPs. Via the application platform 310, dataflow graph compiler 321 can provide an API to enhance programming functionality available to application developers.

Algebraic graph compiler 322 can include a Model Analyzer and Compiler (MAC) level that can make high-level mapping decisions for sub-graphs (also referred to as “sections” or “section cuts”) of a dataflow graph based on CGR hardware constraints. Algebraic graph compiler 322 can support various application frontends, such as Samba, JAX, and TensorFlow/HLO. Algebraic graph compiler 322 can also transform the graphs, for example via autodiff and GradNorm, to perform stitching between sub-graphs, interface with template generators for performance and latency estimation, convert dataflow graph operations to algebraic intermediate representation (AIR) operations, perform tiling, sharding (database partitioning) and other application preparation operations, and can model or estimate execution parallelism that can be achieved within the dataflow graphs.

Algebraic graph compiler 322 can include an arithmetic or algebraic intermediate representation (AIR) level that can translates high-level dataflow graph and mapping decisions provided by a MAC level into AIR graphs. An AIR level can include validating and/or correcting (“legalizing”) a dataflow graph and/or mapping decisions of a MAC; expanding data parallel, tiling, pipeline, and/or region instructions provided by a MAC; inserting stage buffers and skip buffers, eliminating redundant operations, buffers, and sections; and, optimizing resource use, execution latencies, and computational throughput.

Template graph compiler 323 can translate AIR graphs to a template library intermediate representation (TLIR). A TLIR can comprise a graph that can optimize configurations and/or execution instructions based on target (CGRS and/or CGR) hardware architecture and/or to unplaced units suitable for place, allocate, and route level PNR 325. Template graph compiler 323 can add further information node names, node inputs, node input names, and dataflow descriptions) as inputs to PNR 325, and can make the graph physically realizable through each layer of the graph. Template graph compiler 323 can, for example, translate AIR graphs to specific application operation templates, such as templates for general matrix multiplication (GeMM), matrix transposition, and/or matrix convolution operations. In implementations a CGRS compiler like compiler 320 a can convert part or all intermediate representation operations to templates, stitch templates into data and control flow of the application, insert necessary buffers and layout transforms, generate test data, and optimize for CGR hardware utilization, execution latency, and compute and/or data transfer throughput.

Implementations can use templates for common operations. Templates can be implemented using assembly language, RAIL, or similar language and/or representation constructs. RAIL can compare to a low-level language, in that memory units and compute units can be separately programmed in RAIL constructs, but RAIL can provide a higher level of abstraction and compiler intelligence that, for example, an assembly language, via a concise performance-oriented and domain-specific language for CGR component (e.g., CGR array) templates. RAIL can enable template writers and external power users to control interactions between logical compute units and memory units of CGR components using high-level expressions, without the need to manually program actions such as capacity splitting, register allocation, etc. RAIL logical compute and memory units can also enable stage/register allocation, context splitting, transpose slotting, resource virtualization and mapping to multiple physical compute units and memory units (e.g., PCUs and PMUs of tiles, such as in the examples of Grohoski and Kumar).

Template library 324 can include an assembler that provides an architecture-independent, low-level programming interface as well as optimization and code generation for CGR hardware. An assembler can include memory address expression compilation, CGR hardware intra-unit resource allocation and management, rendering a template graph physically realizable based on CGR hardware-specific rules, low-level CGR hardware architecture-specific transformations and optimizations, and CGR hardware architecture-specific code generation.

PNR 325 can translate RAIL and/or assembly language outputs of template library 324, and/or TLIR outputs from template graph compiler 323, and can map logical (e.g., unplaced physically realizable) CGR units, to physical CGR hardware implementation levels, such as an SCM, MCM, and/or chip level of CGR components, can determines physical data channels to allow for communication among the CGR units and between the CGR components (e.g., components coupled via a TLN, allocate memory, I/O, and/or switch ports of CGR components, provide CGR component configuration data and initialization data, and can produce configuration files, e.g., processor-executable format (PEF) files. PNR 325 can provide bandwidth calculations, allocate network interfaces, provide configuration data for CGR components to perform memory address translation, and control switch and data routing among CGR components. PNR 325 can perform such functions in multiple steps and can include multiple modules (not shown in FIG. 3 ) to perform the multiple steps, e.g., a placer, a router, a port allocator, and a PEF file generator). PNR 325 can receive input data, for example, from any of the higher-level modules (dataflow graph compiler 321, algebraic graph compiler 322, template graph compiler 323, and/or template library 324). In implementations, a higher-level module, such as template graph compiler 323, can prepare information for PNR 325 and can omit other levels directly providing input data to PNR 325.

Implementations of compiler 320 compile applications in an iterative process, such as feeding information from PNR 325 back to a higher-level module, which can, in turn, execute a new compilation step using physically realized results, rather than estimates of, or logical placeholders for, physically realizable circuits. For example, PNR 325 can feed information regarding the physically realized circuits back to algebraic graph compiler 322.

Memory allocations can represent logical memory spaces in on-chip (a chip implementing a CGR component) and/or off-chip (a chip separate from a CGR component), CGR component memories, for data flowing through the dataflow graph; a configuration file, such as a PEF, can specify particular memory allocations. Memory allocations can define a type and number of CGR hardware memories and/or circuits (functional units, storage, or connectivity components). Main memories (e.g., DRAM) can be, for example, off-chip memories, and scratchpad memories (e.g., SRAM) can be on-chip memories, such as memories of a CGR array. Memory allocations can correspond to various access patterns and/or memory layouts, such as access patterns/layout of cache memories, read-only look-up tables (LUTs), serial memories (e.g., FIFOs), and/or register files.

Compiler 320 can bind memory allocations to unplaced memory units and can bind operations of a dataflow graph to unplaced compute units, for execution of a graph, and configuration data, such as in a PEF, can specify such bindings. In implementations, compiler 320 can partition parts of a dataflow graph into memory subgraphs and compute subgraphs, and can specify these subgraphs in configuration file. A memory subgraph can comprise, for example, address calculations leading up to a memory access. A compute subgraph can comprise, for example, compute operations (compute nodes) in a parent graph. A compiler can divide a parent graph into multiple memory subgraphs and a single compute subgraph, for example. A single parent graph can produce one or more memory subgraphs, depending on how many memory accesses exist in the original graph loop body. In cases where the same memory addressing logic is shared across multiple memory accesses, a compiler can duplicate address calculations to create multiple memory subgraphs from the same parent graph.

Compiler 320 can generate configuration files with configuration data (e.g., a bit stream) for the placed positions, and for routed data and control networks. In implementations this can include the compiler assigning coordinates and communication resources of the physical CGR components by placing and routing unplaced units of CGR components with a goal to maximize compute and/or data transfer bandwidth and minimizing compute and/or data transfer latency.

An application may not itself include backward nodes and, in implementations, a CGRS compiler, such as illustrated by the example of compiler 320, can determine that a model requires backward nodes, and can generate backward nodes in a computation graph. In determining a mapping of an application to CGR hardware resources, a CGRS compiler can identify recompute nodes and can determine section boundaries among forward nodes, backward nodes, and recompute nodes within a graph.

To exploit the full power of a CGRS— particularly, dynamically reconfigurable CGR components of a CGRS— a CGRS compiler must not only generate low level processor instruction sequences, but must also allocate reconfigurable resources of the underlying CGR hardware that can execute the application most efficiently, and with highest possible computational performance. A CGRS compiler must, further, determine controls to sequence transfer in (e.g., to a memory and/or compute unit), processing (e.g., compute unit and/or operator pipelining), and/or transfer out (e.g., from a memory and/or compute unit) of application data.

In optimizing parallelization and computational latency of among CGRS hardware resources, a CGRS compiler must consider complex factors, such as: the number of available processing units (e.g., processors of CGR components); the number, size, and transfer latency of memory units (e.g., memories of CGR components); computational latency of operators of the application; dependencies among operators; and, sections of an application that can execute in parallel, not only intrinsically, but also given the amount of CGRS hardware resources available to execute the sections. Such considerations can be referred to as “mapping factors”.

In implementations a “mapping decision space” can comprise mapping factors. In addition, or alternative, to factors just described, the mapping factors can include parameters and/or attributes of an application and/or CGRS related to mapping factors. Mapping factors included in a mapping decision space can include, for example, descriptions and/or attributes of CGR components; configurations and/or arrangements of data nodes, compute nodes, and interconnections of nodes (edges) of a graph and CGR components; and/or, groupings (“section cuts”) of operators of a graph into particular pipelines and sections. Mapping factors of a mapping decision space can include alternative such configurations and section cuts, and can include costs (e.g., hardware utilization, compute and/or data transfer bandwidth or latency) associated with the alternatives. Mapping factors of a mapping decision space can include optimization goals (e.g., optimizing utilization over latency, or vice versa) and/or priorities of execution of particular nodes of a graph.

Mapping decisions can comprise tiling alternatives to apply to operands/results matrices, alternative groupings of operators within pipelines and/or sections, and “PAR” (parallelization) factors associated with parallel execution of operators among alternative pipelines and/or section cuts. Mapping decisions can comprise, or be based upon, performance characteristics of mapping alternatives, such as computational latencies and/or CGRS hardware utilizations associated with different mapping decisions. Mapping decisions can include pipeline, tiling, and/or section cut options that can optimize particular performance characteristics (e.g., mapping decisions that can be preferred to optimize a particular performance characteristic of executing the application on CGRS hardware).

FIG. 4A illustrates mapping factors and a mapping decision space a CGRS compiler can utilize in mapping operators and data of an application to underlying hardware resources of a CGRS (e.g., CGR components of a CGRS). A MAC component of a CGRS compiler, for example, can generate and/or analyze a computation graph of an application to determine mapping factors of a mapping decision space. For example, a MAC can traverse a graph, such as in the example of FIG. 2 , to determine mapping factors of a mapping decision space.

In implementations, a compiler can determine a mapping of an application (e.g., operators and tensors included in a graph of an application) to CGR hardware resources for execution of the application. A compiler, or a MAC of a compiler, can include a hardware mapping component—referred to herein as a “mapper”—and the mapper can analyze a graph to map operators, tensors, and/or tensor dataflow of an application to CGR hardware for execution.

For purpose of illustrating the disclosure, example operations of the disclosure, such as example operations of FIG. 4A, are frequently described as performed by a MAC, and/or components of a MAC, of a CGRS compiler. However, this not intended to limit implementations and one of ordinary skill in the art will appreciate that a compiler need not necessarily comprise a CGRS compiler, a MAC of a CGRS compiler, and/or particular components (e.g., a mapper) of a compiler or a MAC to perform methods, and/or steps of methods, of the disclosure. Components of a compiler alternative to these particular components can perform methods and operations of the disclosure within the scope and spirit of the disclosure.

In FIG. 4A, decision space 400 is an example of a mapping decision space that a CGRS compiler can utilize to determine alternatives to map an application to CGR hardware for a CGRS to execute the application efficiently. Decision space 400 can represent a combination (not necessarily exhaustive) of mapping factors 402-412 (collectively, “mapping factors 400” in FIG. 4A) that a CGRS compiler can include in a mapping decision space such as example decision space 400.

In FIG. 4A, app 418 can comprise an application, and/or application model, (e.g., represented as a graph and/or textual representation) and MAC 416, in FIG. 4A, can be a MAC component of a CGRS compiler configured to compile app 418. MAC 416 can generate decision space 400 to execute app 418 on CGR hardware that can be represented by hardware attributes 414. In the example of decision space 400, mapping factors 400 are shown in FIG. 4A including PAR factors 402, tiling factors 404, model/data parallelism 406, stage boundaries 408, recompute sections 410, and section/HW boundaries 412.

PAR factors 402 can comprise, for example, parallelization (“PAR”) factors included in a template (e.g., a template among template library 324 in FIG. 3 ) that can represent an intrinsic, or application programmer preferred, parallelization of model operators. Tiling factors 404 in decision space 400 can include alternative, and/or optimal, tiling of operator and/or pipeline input data, operand matrices, and/or operator results matrices. Tiling a graph refers to partitioning, or “slicing”, operand/results matrices of tensor data input to, and output from, operators in the graph into smaller matrices (“tiles”). As used herein, the term “hardware tile” refers to a tile such as described in Grohoski and Kumar, comprising an array of compute (PCU) and/or memory (PMU) units. In contrast, the term “tile”, used without the qualifier “hardware”, refers to a partition of a larger matrix, such as an M×K/2 tile formed by slicing an M×K matrix into two M×K/2 tiles. Similarly, the term “tiling”, as used herein, refers to partitioning a matrix into smaller tiles.

A MAC can tile the matrices based on, and/or to preserve, a particular, shared dimension of the matrices (e.g., a row dimension or a column dimension of the matrices). Model/data parallelism 406 can include boundaries of operator and data parallelism, which can represent, for example, a degree of parallelization of model operators and data. Stage boundaries 408 can include, for example, boundaries of pipeline stages of underlying CGRS and/or CGR component hardware.

As illustrated in the examples of FIG. 2 , a model can comprise sections. Operators that cannot be executed in parallel (e.g., operators that cannot be included in a pipeline with another operator) cannot be included in the same section of an application. Similarly, underlying CGR hardware can have limits to the number and/or type of operators that it can perform in parallel, and/or the amount of data it can process (e.g., based on sizes of memory to buffer or store input data and/or computation outputs). Thus, section/HW boundaries 412 can include boundaries, within a model or graph of a model, between forward and backward sections of the model, and/or boundaries of CGR hardware to execute operators within particular sections of a graph. Hardware boundaries among section/HW boundaries 412 can be based on a hardware description, and/or attributes of hardware, of CGR hardware, such as can be included in hardware attributes 414.

Backward nodes can be feedback paths, in the model, to recompute nodes, and the recompute nodes can be factors of decision space 400, such as to determine dependencies among sections and operators within sections. Recompute sections 410, for example, can represent combinations of operators that recompute particular application functions, such as recomputing activation functions using results (e.g., gradient adjusted matrices) of backward section operators.

In implementations, a compiler can represent an application, and/or a graph, using high level language (HL), intermediate level (IL), and/or low level (LL) language constructs and/or statements that can represent operators, operands/results of operators, and/or interconnections of the nodes and/or allocation of CGR hardware to execute the application. HL, IL, and/or LL representations can be, or can represent, an application graph or model. HL, IL, and LL language constructs/statements can describe nodes and edges of a graph, and/or instructions for executing the graph (i.e., executing the application as represented by the graph) on CGR hardware. HL, IL, and/or LL language constructs and/or statements can include compiler generated mapping alternatives and/or decisions as to how to map the application to CGR hardware for execution.

A compiler can generate a high level graph representation (“HLR”) of an application. The compiler can utilize an HLR, for example, to analyze overall execution elements of the application, and/or to determine initial alternatives for mapping operations of the application to CGR hardware, such as tiling, section cut, and/or parallelization factors in mapping the application.

A compiler can generate, for example, an IL representation (ILR) of the graph that can incorporate mapping alternatives and/or decisions. For example, a compiler can translate an HL graph into an ILR such as an AIR graph and/or a TLIR graph. A compiler can compile, or translate, an ILR to an LL representation (LLR), such as a RAIL representation, that can describe configuration and/or execution instructions to execute the application using particular CGR hardware and/or configurations. The LLR can be suitable for generating application execution code specific to the CGR hardware, such as a PEF and/or configuration files. An ILR and/or LLR can be textual and or graphical, and can be another form of an application, or subset of an application.

A compiler can analyze graphs to determine execution parameters corresponding to CGR hardware allocated to execute the application. For example, a compiler can analyze an ILR (e.g., AIR) or LLR (e.g., RAIL) to determine execution latencies, processor/memory utilizations, and various other such metrics of application execution based on an IL or LL graph that includes CGR hardware resource allocations and/or execution on CGR hardware.

FIG. 4B illustrates example MAC 420, which can provide functions of a MAC such as MAC 416 in FIG. 4A. FIG. 4B depicts MAC 420 comprising MAC front end 422, HL optimizer 424, mapper 426, IR out 430, and estimator 428. In implementations, MAC front end 422 can comprise, for example, an API to input an application and/or application programming statements to compile for execution by a CGRS, shown in FIG. 4B as app 440. MAC front end 422 can comprise interfaces and/or functions to access hardware descriptions of the CGRS, to access or interact with other components of a compiler that includes MAC 420, and/or to access or interact with components of a host processor and/or the CGRS. MAC front end 422 can convert an application or application, such as app 440, to a graph and/or an intermediate representation (IR), for MAC 420 to determine mapping decisions to execute app 440.

HL optimizer 424 can perform high level optimization of app 440 and/or a graph of app 440, such as fusing operators (nodes) of a graph into higher level operators, eliminating no-ops and/or redundancies within app 440, and/or compute derivatives (e.g., Autodiff). In implementations, a compiler can determine a mapping of an application (e.g., operators and tensors included in a graph of an application) to CGR hardware resources for execution of the application. Mapper 426 can be a mapper component or function of MAC 420 that can determine mapping decisions to include in a mapping decision space, such as tiling, section cut, and/or parallelization decisions for mapping app 440 to CGR hardware for executing app 440.

Mapper 426 can utilize estimator 428 to determine, for example, model execution metrics such as computational latencies of CGRPs executing operators of app 440, data transfer latencies among memories of CGR hardware (e.g., memories of CGRPs executing operators of app 440), computational throughput among CGRPs executing operators of app 440, and/or amounts of memory required for operands/results tensor data of operators of app 440. Mapper 426 can output mapping decisions to IR out 430 and IR out 430 can translate, or otherwise convert, the mapping decisions to an intermediate representation of app 440 that includes mapping decisions to execute app 440 on the CGR hardware.

As neural networks form the basis of many dataflow applications, neural networks can represent useful applications to illustrate the disclosure, and examples and descriptions of the disclosure make frequent reference to NNs as an example application. However, this is not intended to limit implementations and one of ordinary skill in the art will appreciate that the scope and spirit of the disclosure, and the methods and/or structures of the disclosure, can encompass user applications suitable for execution on CGR systems other than NNs.

In implementations, a MAC can analyze an application (e.g., a graph of the model) to determine mapping factors included a mapping decision space, such as mapping factors in decision space 400 of FIG. 4A. A MAC can analyze an application or graph to determine operators that can form pipelines, and alternative pipelines, and associated sections including the pipelines, and can include the pipelines in a decision space (e.g., among section and HW boundaries 412 of decision space 400 in FIG. 4 ).

Pipelines of operators can be possible along certain dimensions and/or tile sizes of operands/results tensors, and may not be possible among others. For example, a producer GEMM and consumer ADD operator can form a pipeline over dimension M of an M×N GEMM output tensor and M×J ADD addend tensor. Sizes of GEMM output and/or addend input tensor tiles can determine whether or not CGR hardware (e.g., a particular CGRP) can process a tensor. A CGRP executing the GEMM operator, for example, can require a tile size (e.g., to fit in a CGRP memory) different from that of a CGRP executing the ADD operator (or, vice versa), such that the two operators can form a pipeline only using tile sizes suitable for each of the two CGRPs. Tiling the GEMM output tensor and the ADD addend tensor on dimension M (e.g., forming two tiles of each tensor to have dimension M/2) can permit, or be necessary, to enable the CGR hardware to execute the ADD operation as a pipeline with the GEMM operator. Thus, tiling decisions can determine if successive operators in a graph can form a pipeline. In evaluating alternative tiling decisions, a mapper can evaluate dimensions and/or sizes of producer/consumer operator results/operands matrices to determine pipelines that can be formed based on those dimensions/sizes.

A MAC can perform multiple decision passes over a graph (or, elements of a graph), search space, and/or mapping decision space to determine mapping decisions. For example, a MAC can make a tiling pass, to determine tiling decisions that can apply to the operands/results matrices. A mapper can perform a section mapping pass to determine pipelines and groupings of operators into sections. The mapper can use results of a section mapping pass to make section cut decisions included in a mapping decisions space. A mapper can perform a parallelization (“PAR”) pass, based on results of the tiling and/or section mapping passes, to determine parallelization alternatives for executing operators of section cut alternatives on particular CGRS hardware.

FIG. 5 illustrates example method 500 for a mapper to perform multiple decision passes to determine mapping decisions. The method is described as performed by a MAC component of a CGRS compiler to determine mapping decisions such as previously described. However, this is only to illustrate the disclosure and not intended to limit implementations. It would be appreciated by one of ordinary skill in the art that a compiler need not necessarily comprise a MAC to perform the method or operations of the method. It would be further appreciated by one of ordinary skill in the art that a compiler can analyze a graph in manners alternative to, or inclusive of, the example of method 500, and that any particular component, or combination of components, of a compiler, or components of a computing system alternative to a compiler, can perform the method, and/or steps thereof.

In step 502 of method 500, the MAC generates (or, alternatively, receives) a graph (hereinafter, in reference to method 500, “the graph”) corresponding to an application. The graph can comprise operators and operands/results matrices of the operators, and their arrangement, dependencies, and data flow among the operators, such as previously described. The graph can comprise an initial graph of an application and/or an auxiliary graph generated by the compiler based on an initial graph of an application.

In step 504 the MAC can, optionally, generate a search space that can include operators, operands/results matrices of the operators, and/or attributes of operators and/or operands/results matrices (e.g., dimensions, operator types, connection topologies, etc.). The MAC can perform steps among steps 506-510 to perform multiple decision passes associated with the graph. In each of steps 506-510, the MAC can traverse the graph and, optionally, query a search space, to determine attributes of the application operators, operands, and/or results to further determine mapping decisions. The MAC can traverse the graph in a variety of alternative traversal orders, such as depth-first or breadth first topological orders, or combinations of these. The MAC can traverse the graph recursively within a topological order.

In step 506 the MAC determines tiling decisions to slice operand/results matrices of the application. In implementations a tiling decision can comprise a dimension on which to slice a producer results matrix and/or consumer operand matrix such that the producer and consumer can form a pipeline. As in the previous example of an M×K results matrix or a producer operator and a K×N operand matrix of a consumer operator, tiling the matrices on dimension K can be a component of a tiling decision.

Additionally, a tiling decision can comprise a size and/or number of slices of a results and/or operand matrix. Using the same example of M×K and K×N results/operands matrices, a mapper can determine (for example reasons to be discussed further on) to slice the M×K results matrix into some number, adding to a total of M, of smaller matrices having column dimension K. Alternatively, or additionally, a mapper can determine to slice the K×N operand matrix into some number, adding to a total of N, of smaller matrices having row dimension K.

Tiling decisions can include tiling results matrices output from a producer pipeline and/or operands input (e.g., results input as operands) to a consumer pipeline. A mapper can determine a tiling decision of a pipeline such the tiling decision includes tiling decisions for nested (inner or child) pipelines.

One way to refer to a matrix, and tiles of matrices in particular, is to refer to a “shape” of the matrix. The shape of a matrix can be defined as the number of elements in each dimension of the matrix, sometimes represented as a tuple representing each dimension. To illustrate further, the shape of an M×K matrix can be said to be “M,K”, and the shape of a K×N matrix can be said to be “K,N”. A tiling decision can comprise, for example, an identity of a dimension on which to pipeline a producer and consumer matrix, and one or more shapes of matrices for different tiling alternatives (e.g., tiling a M×K matrix into two M/2×K matrices).

In step 508 the MAC determines section groupings (section cuts) of the operators of the graph. The MAC can determine section cuts based on, for example, tiling decisions determined in step 506, and/or relationships among operators of the graph, such as data flow relationships, and/or types of operators among operators of the graph. In step 508 the MAC can query a search space to determine operators that can be combined into particular sections (section cuts) that group operators to form a pipeline and/or pipeline of pipelines.

In step 510 the MAC determines PAR factors associated with tiling alternative determined in step 506 and/or section cuts determined in step 508. The MAC can, in step 510, determine PAR factors based on, for example, performance characteristics of the decisions as executed by particular hardware components of a CGRS. In step 510 the MAC can determine the PAR factors based on a hardware description of CGRS hardware resources available to execute the application.

In step 510 a MAC can determine PAR factors based, for example, on results of step 506 and/or step 508. PAR factors can include metrics such as a number of operands that can be processed in parallel within a pipeline, or pipelines; parallel or concurrent utilization of memories to execute particular operators and store their respective operands/results; staging of operands/results among various memories (e.g., “stage buffers”) for execution by different operators; and/or, a number of particular compute units that can execute the model in parallel. In step 510, the MAC can query a search space to determine of different operators corresponding to section and/or tiling decisions.

In step 512, the MAC can determine if mapping decisions determined in steps 506-510 are valid and/or good. A mapping alternative can be a “valid” alternative if, for example, that alternative can “fit” in available CGRP hardware (e.g., operands/results of operators can be stored in one or more particular memories). A mapping alternative can be “good” if that alternative can achieve one or more mapping optimization goals, such as minimizing usage of particular CGRS memories (e.g., memories of CGRPs), or types of CGRP memories, minimizing a number of memory transfers and/or transfer latencies, minimizing computational latencies of an operator and/or pipeline of operators, and/or maximizing utilization of processors and/or memories of CGRP hardware.

If, in step 512, the MAC determines that mapping decisions resulting from one or more of steps 506-510 are not valid, not good, or a combination thereof, the MAC can repeat steps among steps 506-510 to determine additional or replacement mapping decisions. Alternatively, if the MAC determines, in step 512, that mapping decisions determined in one or more of steps 506-510 are valid, good, or a combination thereof, in step 514 the MAC outputs mapping decisions (e.g., CGR hardware resource allocations, operand/results tiling decisions, PAR factors from among the mapping decisions determined in steps 506-510.

In step 514 the MAC can select particular mapping decisions and output these as mapping decisions for execution of the model on CGR hardware. Alternatively, or additionally, the MAC can output all, or a subset, of mapping decisions as potential mapping decisions, and another component of the compiler, or of an CGRS for executing the application, can select particular mapping decisions as mapping decisions to configure CGR hardware and execute the application. In step 514 the MAC can output the mapping decisions to a mapping decision space (e.g., a data structure comprising mapping decisions), and/or to a search space. In step 514 the MAC can output the mapping decisions, for example to include in an IR of mapping decisions to execute the application, and/or an aux graph of the application.

A particular function of a CGRS compiler (e.g., of a MAC and/or mapper) is to allocate CGR hardware to execute an application, and/or manage application execution (e.g., schedule particular CGR hardware to execute particular operators of an application) on CGR hardware. A CGRS compiler can allocate CGR hardware and/or manage application execution on CGR hardware based on particular mapping decisions (e.g., tiling, section cuts, and PAR factors) and CGR hardware characteristics. A CGRS compiler can allocate CGR hardware and/or manage application execution on CGR hardware r to achieve particular optimization objectives.

In describing features of the disclosure, the term “application” refers interchangeably to an application overall and a portion of an application, such as a section cut of a graph of the application, unless otherwise stated to apply to an application overall or a particular portion of an application. Similarly, the term “graph” refers interchangeably to a graph of a complete application and a portion of such a graph, such as a section cut of a graph of the application. Thus, “executing an application” refers interchangeably to executing an application overall as well as to executing a particular portion of an application, such as a particular section cut of a graph. Similarly, “mapping a graph” can refer equally to mapping a complete graph of an application or mapping a particular portion of a graph, such as a particular section of a graph.

Optimization objectives can include throughput objectives, memory optimization objectives and/or processing optimization objectives. In implementations, throughput objectives can include, or can correspond to, application execution throughput, such as overall training throughput in training a machine learning application, or throughput of a section of a graph, for example.

Memory optimization objectives can include, or correspond to, objectives relating to use of memories to execute the application, such as fitting (storing) all elements of operand and/or results matrices within particular memories, such as CGR memories (e.g., PMUs) and/or other (e.g., host or runtime) memories; reducing memory to memory transfer sizes and/or latencies; minimizing or, alternatively, maximizing utilization of a total number, or particular type, of memories; and/or minimizing a number of stage buffers in a pipeline.

CGR hardware can comprise, or otherwise have access to, a mix of “on-chip” and “off-chip” memories. On-chip memories (e.g., in the examples of Grohoski and Kumar, PMUs, SRAMs, scratch pad, stage buffers, and/or caches) can be integrated in a CGRP, and/or an IC, to be closely coupled to one or more CGRPs. Off-chip memories can be memories (e.g., DRAM memories) of, or accessible to, CGRPs that are implemented on an IC different from that of a processor, or compute unit of a CGRP executing an operator. Off-chip memories can be larger (have greater data capacity) than on-chip memories, but can be accessible to CGRPs at generally lower bandwidths or clock frequencies in comparison to on-chip memories.

CGR hardware can comprise a mix of on-chip and off-chip memories, such that a particular allocation of these memories to operators in a pipeline, and/or CGRPs processing operands/results data in particular memories, can dramatically affect throughput and/or computational latency of model execution on the CGR hardware. Thus, memory optimization objectives can include balancing utilization of on-chip versus off-chip memories in CGR hardware to execute an application.

Processing optimization objectives can include, or correspond to, objectives relating to CGRPs, or components of CGRPs, to execute an application, such as maximizing the number of stages and/or operators in a pipeline; maximizing the number of parallel operations of a graph (e.g., parallel CGRPs, computations, and/or data transfers); maximizing or, alternatively, minimizing, utilization of certain CGRPs, or a number of CGRPs, executing an application; minimizing computational latencies for operators in a graph, and/or particular CGRPs executing the operators; and/or balancing pipeline stages (e.g., balancing tiling operand/results, and mapping operators in a pipeline to CGRPs, such that all stages of the pipeline execute with no, or minimal, interstage delays).

Optimization objectives can include user-defined objectives (e.g., throughput, memory, and/or processing objectives determined by a programmer/developer, or user, of an application), and/or system-defined objectives. User-defined and/or system-defined objectives can be based on CGRS and/or CGR hardware design. User-defined and/or system-defined objectives can be included, for example, in application programming statements and/or constructs (e.g., data structures), and/or compiler input files.

In implementations, optimization objectives can correspond to, and/or be based upon, particular optimization metrics. Optimization metrics can include, for example, a data transfer latency; a computational latency; a total execution latency; a computational and/or data transfer throughput; a number of parallel computations and/or data transfers; a memory utilization; and/or a processor (e.g., CGRP) utilization.

Mixed Integer Modelling

It can be a general objective of a CGRS compiler to determine a performant mapping of an application to CGR hardware that allocates particular amounts and/or types of CGR hardware resources to operators of an application that obtains a high application execution throughput. However, CGR hardware can comprise tens, or even hundreds, of thousands of processing and/or memory units (e.g., CGRPs, hardware tiles, PCUs and/or PMUs). Efficiently mapping application operators and data, such as tensor inputs/outputs, to such an enormous number of CGR hardware resources, and to achieve particular execution objectives, can present a highly complex computational problem. Thus, it is highly desirable and advantageous to implement methods that can simplify the computational complexity of mapping an application to CGR hardware for executing the application.

A CGRS compiler can determine and/or evaluate mapping decisions using heuristic-based approaches. Heuristics to evaluate mapping decisions can include selecting empirical values of graph computations, and/or CGR hardware operational characteristics (e.g., bandwidths, throughput, latencies, and/or numbers, of hardware elements). An application developer and/or components of a compiler can apply such heuristic values determined based on past application execution. However, selecting particular heuristics can be complex in light of the scale of CGR hardware resources as well as the scale of an application comprising potentially billions of tensors and/or operators, particularly if performed as a manual tuning process (e.g., by an application developer or CGRS engineer). Additionally, such heuristics can be highly macroscopic, or directed to a narrow set of execution performance parameters (e.g., memory utilization versus processor utilization), and can overlook finer mapping optimizations that can improve application execution over mapping decisions based on heuristics.

Alternatively, implementations can automate and improve determining optimal mapping decisions, and/or selecting optimal mapping decisions from among a potentially large number of possible mapping alternatives, using a mixed-integer (MI) mathematical model (“MI model”) of a graph and CGR hardware to execute the graph. An MI model can comprise MI linear equations that can include mixed continuous and integer variables. A CGRS compiler can generate an MI model to mathematically represent mapping decisions, or effects of mapping decisions, as well as CGR hardware characteristics and/or execution objectives. Using an MI model, a CGRS compiler can determine an optimal mapping alternatives as a high-dimensional mathematical optimization problem. Using an MI model to solve a set of linear equations can be computationally very efficient (e.g., in terms of compilation compute time and resources) to determine optimized mapping decisions.

An MI model can comprise an “objective function” representing application throughput (which can be directed to throughput of particular sections of an application). An objective function can include decision variables that represent CGR hardware resources, and constraints that represent invariants of CGR hardware resource allocations. An MI model can include linear equations that incorporate the decision variables, constraints, and objective function(s) based on mapping decisions and CGR hardware design. A CGRS compiler (e.g., a MAC of a CGRS compiler) can generate an MI Model and input the MI model to an “MI Solver” designed to solve for the decision variables using mixed integer linear equations. An MI solver can comprise a system and/or one or more programs for solving linear equations. An MI solver can be, or can be a component of, for example, a commercially available MI Solver program, such as Gurobi® or Google® Optimizer Tools, or any software program, processor, and/or hardware circuit designed to solve such linear equations. An MI solver can execute on a processor of a computing system that can execute a CGRS compiler, or can execute on a processor of a computing system communicatively coupled to a computing system for executing a CGRS compiler (e.g., a computing system commercially offering MI computational program services).

In implementations, decision variables can be independent variables that represent hardware allocation decisions. Equations of an MI model can solve for values of decision variables, and solutions to the decision variables can be outputs of an MI solver the solves equations of the MI model. Constraints can be linear equations that describe invariants the must be held true for a mapping decision to be valid. If an allocation violates a constraint equation, then the allocation does not correctly execute the application model (i.e., does not correctly execute the computation graph) or the CGR hardware cannot execute the graph based on those mapping decisions (e.g., execution of the nodes in a section or stage exceeds the capabilities, or size, of the available CGR hardware). Objective functions can describe metrics for which mapping decisions should optimize.

A CGRS compiler can generate an MI model comprising decision variables, constraint equations, and an objective function and can input the MI model decision variables, constraint equations, and objective function to an MI solver. The MI solver can solve the equations to find values of the decision variables that give highest value, for a function intended to maximize a metric (e.g., utilization), or lowest value, for a function intended to minimize a metric (e.g., latency), of the objective function(s).

A CGRS compiler (or, components of a CGRS compiler) can determine elements of an application, such as operator nodes and/or data inputs/outputs (e.g., tensors) of an application, by analyzing any of a variety of representations of the application, including high level programming statements (e.g., Python), an HLR, and/or LLIR. However, it can be particularly advantageous (e.g., less computationally complex) for a MAC to determine elements of an application using a computation graph of the application. Thus, for purposes of illustrating methods of the disclosure, but not intended to limit implementations, the disclosure uses the example of a computation graph as representing an application for performing methods of the disclosure.

FIG. 6 illustrates example method 600 for a CGRS compiler to generate an MI model of an application graph and CGR hardware and use the MI model to determine application mapping decisions and CGR hardware allocations that can achieve a particular application execution objective, such as increasing or maximizing execution throughput, or decreasing or minimizing execution latency. A MAC component of a CGRS compiler can perform the method and, for purposes of illustrating the disclosure, the methods are described as performed by a MAC component of a CGRS compiler (hereinafter, with reference to method 600, “the MAC” and “the compiler”, respectively). However, this is not intended to limit implementations and one of ordinary skill in the art will understand that the methods, and/or operations of the methods, can be performed by a component of a compiler, or of a computing system, in addition or alternative to a MAC.

In step 602 of method 600 the MAC receives (or, alternatively, retrieves, such as from a file system) a graph (or, a portion of a graph) of a dataflow application (e.g., a neural network) and a CGR hardware specification or description. The MAC can receive the graph as an input to method 600 or can itself generate the graph in step 602. The hardware specification can describe CGR hardware configurations (e.g., number and types of processors and/or memories and their interconnection topologies) and/or operating characteristics (e.g., bandwidths, sizes, latencies, and/or throughputs) of CGR hardware allocated in the mapping decisions.

In step 602 the MAC analyzes the graph to determine operator nodes of the application to map to particular CGR hardware. The MAC can determine the nodes based on, or as corresponding to, mapping decisions for mapping the graph to CGR hardware (e.g., mapping decisions determined by a mapper component of the compiler). In step 602 the MAC can determine forward sections and/or nodes, backward sections and/or nodes, and/or recompute node of the graph to perform method 600. For example, the MAC can determine one or more forward sections of the graph, and nodes included in the forward sections, and can perform method 600 to determine a performant mapping of the nodes of the forward sections to CGR hardware. The MAC can additionally, or alternatively, determine one or more backward sections of the graph, and nodes included in the backward sections, and can perform method 600 to determine a performant mapping of the nodes of the backward sections to CGR hardware. The MAC can determine, in step 602, that particular nodes of the graph should recompute their outputs as inputs to other nodes of the graph, and can perform method 600 to determine a performant mapping of the recompute nodes sections to CGR hardware. The MAC can perform method 600 for forward, backward, and/or recompute nodes/sections individually or in any combination.

In steps 604-608 the MAC generates decision variables and linear equations of an MI model associated with mapping decisions to execute the nodes determined in step 602, such as CGR hardware allocations included in the mapping decisions.

In step 604, the MAC generates one or more objective functions. Objective functions can describe an optimization metric (e.g., a latency or throughput metric) that can be a basis for determining optimized mapping decisions and/or CGR hardware allocations. Objective functions can incorporate values to maximize, and/or values to minimize, as metrics to determine mapping decisions, such as CGR hardware allocation, and to evaluate the mapping decisions. In implementations an objective function can comprise, for example, a linear function that maximizes processor and/or memory utilization, a linear function that minimizes overall model execution latency, a function that optimizes stage utilization and/or node parallelism, and a linear function that minimizes transfer latencies among memories used in executing a node. The objective function(s) can be input to an MI Solver for the solver to compute a minimum (or, alternatively, maximum) value of the objective function(s).

In step 606 the MAC determines decision variables and decision equations to determine values of the decision variables. In implementations, decisions variables can be independent variables that represent CGR hardware allocation decisions. An MI Solver can solve decisions equations that include these variables to evaluate mapping decisions. Decision equations can comprise mixed integer equations and an MI solver can use the decision equations in combination with the objective function to determine a mapping that can satisfy the objective function(s).

In implementations, decision variables can include, for example, node parallelization factors; CGRP (e.g., PCU) usage associated with nodes, sections, and/or stages of sections, of the graph; and/or CGR memory (e.g., PMU or other local memories, and/or remote memories such as DRAMs) usage associated with associated with nodes, sections, and/or stages of sections, of the graph. Decision variables can include processing and/or memory transfer latencies of nodes, sections, and/or stages of the graph.

Decision variables can include sets, such as sets of nodes in a section and/or stage of a section, and/or a sets of stages within a section. The sets can correspond to forward sections and/or backward sections of the graph. Decision equations can include equations to determine such sets, or optimization metrics associated with such sets. Decision variables can include binary variables indicating, for example, whether a node is in a particular section or stage, whether a node is recomputed in a backward section, if a node needs to load from a memory (e.g., a remote memory) an output (e.g., a tensor) of another node or section, if a node needs to save in a memory (e.g., a remote memory) an output (e.g., a tensor) of the node, and/or whether a node is in a section different from children nodes of the node (a child node being a node included a nested pipeline that receives an output of a parent node).

Decision equations can comprise, for example, equations to determine numbers of processors and/or amounts of memories required to execute a node by the CGR hardware. Decision equations can include equations to determine processor, memory, and/or data transfer latencies to execute a node by the CGR hardware. Decision equations can comprise equations to determine nodes included (or, to include) in a section or stage of an application or graph, to determine size and/or shapes of tiles of tensor inputs/outputs of nodes of a graph, and/or to determine parallelization factors to execute nodes of a graph.

In step 608, the MAC generates constraint variables and constraint equations that can represent constraints (execution invariants) that mapping decisions must not violate. For example, CGR hardware, or particular CGR hardware available to execute an application, can impose certain constraints on mapping decisions, such as the number of CGRPs and/or memories available to execute an application or particular nodes of a graph. The number of processors and/or memories on a chip to execute an application or particular nodes of a graph can be a constraint on allocation of CGRP and/or memory resources. These are execution invariants in that they represent execution limits that are not changeable (e.g., it is not possible to allocate more processors than are available).

Constraints can comprise linear equations that describe execution invariants (e.g., maximum usable processors or memory) that must hold true for a mapping decision to be valid. If a mapping decision violates a constraint (e.g., a number of processors allocated is greater than a number of processors in a CGRS or available for execution), the graph can be incorrectly executed in the mapping decision, or cannot be executed on the CGR hardware allocated in the decision, or CGR hardware available to execute the graph, or nodes of the graph. In step 608 the MAC can determine variables, and values of the variables, representing the constraints.

To determine constraint equations, the MAC can utilize the hardware specification received or retrieved in step 602. The hardware specification can describe, for example, CGR hardware configurations (e.g., number and types of processors and/or memories and their interconnection topologies) and/or operating characteristics (e.g., bandwidths, sizes, latencies, and/or throughputs) of CGR hardware allocated in the mapping decisions.

In implementations, a hardware specification can include aspects of the CGR hardware such as the number of processors available to execute the application model, the number and/or types of memories available to execute the application model, and processor and/or memory bandwidths. The processors and/or memories can be associated with a particular chip (e.g. an on-chip memory of a particular chip, or an off-chip memory of another chip), and constraints can include the number of processors and/or memories within a chip boundary. The MAC can generate the constraint variables and equations, in step 608, based on the hardware specification.

In step 608 the constraint equations can include equations corresponding to stage and section assignments of mapping decisions. Such equations can assign each node (and/or, verify assignment of each node) of a graph to only one stage and one section of the graph. In step 608, the MAC can analyze the graph to determine forward and/or backward nodes or sections, and/or to determine nodes that are recomputed.

The MAC can determine bounds constraint variables and/or bounds equations to include in the constraint equations. Bounds variables and equations can correspond to limiting metrics of the CGR hardware, such as number of processors, memories, and/or compute/data transfer latencies. Bounds variables and equations can correspond, for example, to limits to the number of nodes, sections, and/or stages that can execute on the CGR hardware.

In step 608 the constraint equations can include equations corresponding to data dependencies among nodes. Certain nodes (operators) of a graph can require outputs of other nodes, and MI model constraints can include input/output data dependencies of the nodes. In step 608 the MAC can generate equations that represent such data dependencies among nodes, and which can be used to determine that stages and/or the same section do not include nodes that violate such dependencies. Determining data dependencies, in step 608, can include determining nodes that are to be recomputed, based on feedback from nodes in backward sections. In implementations an MI model can include equations that determine dependencies based on parent/child relationships among the nodes.

In step 608 the MAC can generate hardware usage equations to compute processor and memory usage of nodes, sections, and/or stages. Hardware constraint equations can compare computed values of such usage equations to determine if the allocation of hardware in mapping decisions does not violate the hardware constraints. For example, a hardware usage equation can compute the number of processors a mapping decision allocated to nodes in a section of the graph. Hardware constraint equations can compare computed values of a processor usage equation, for a given mapping decision, to constraint variables or equations, for example, to verify that a mapping decision for a section of a graph does not have a processor usage greater than the number of available processors (e.g., number of PCUs in a tile and/or on a chip, or available to allocate from among CGR hardware).

Similarly, a hardware usage equation can compute the number and/or sizes of memory allocated to nodes in a section. A memory usage equation can compute memory usage of a section of a graph as, for example, one or, alternatively, a sum of one or more, of memory usage of nodes in the section, tensor memory usage, and stage buffer memory usage. A hardware constraint equation can determine if computed values of the memory usage equation, for a given mapping decision, does, or does not have a memory usage greater than the number and/or size of available memories (e.g., number of PMUs in a tile and/or chip, or available to allocate from among CGR hardware).

In step 608 the MAC can generate memory transfer equations to compute the size of data transfers among memories, such as data transfers between local memories of processors allocated in mapping decisions, and/or between a local memory of a processor and a remote memory, in a mapping decision. MI model latency equations can comprise equations to compute processing and/or memory transfer latencies associated with executing nodes of the graph. Latency equations can incorporate results of transfer size equations to determine, for example, node, section, and/or stage. latencies to execute the application on particular CGR hardware. In step 608 the MAC can generate equations to compute processing and data transfer latencies, such as stage and section latencies to complete execution of nodes included in those stages and/or sections.

In analyzing the graph, such as in step 602, the MAC can determine backward and/or recompute nodes, and can generate equations that incorporate aspects of backward stages, sections, and/or nodes, and/or recompute nodes that can differ from aspects of forward sections or nodes in forward sections. For example the number of nodes included in particular sections can account for backward nodes that need outputs of other forward or recompute nodes. A MAC can modify a backward section of a graph to account for backward nodes that need outputs of recompute nodes, for example, and equations generated in steps 604-608, to assign nodes to sections and stages, can include assignment of backward and/or recompute nodes to sections and stages based on the modified graph. The MI model generated in steps 604-608 can include variables and equations to compute the total number of sections and/or stages, and/or which nodes to include in each stage and section, in a mapping solution

In step 610 the MAC outputs the MI model (e.g., the objective function, the decision variables, and the equations), generated in steps 604-608, to an MI solver and invokes the MI solver to solve the objective function and equations of the MI model. Solving the equations of the MI model can produce a “globally optimized” mapping of the graph to CGR hardware. A globally optimized mapping can comprise mapping decisions for all nodes of an input graph, and allocations of CGR hardware to all nodes of the input graph, that can achieve an optimization objective characterized by the objective function of the MI model.

In step 610 the MAC can output the MI model to a memory, and/or to a file of a file system, and the MI Solver can access the MI model in the memory and/or file. Invoking the MI Solver to solve the equations of the MI model, in step 610, can comprise a function of the MAC outputting the MI model to the MI Solver, and/or can comprise a communication from the MAC to the MI Solver (e.g., a CLI or API communication of the MI Solver). Invoking the MI Solver to solve the equations of the MI model, in step 608, can comprise submitting a job to a computing system that executes the MI Solver.

In step 612, the MAC receives an MI solution, computed by the MI solver, representing globally optimized mapping decisions for allocating CGR hardware to operations and dataflow of the input application model. The MI solution can comprise mapping decisions, such as section and/or stage boundaries within the graph, node parallelization factors, and/or recompute decisions among nodes of the graph, based on solutions to equations of the MI model and the objective function.

An MI Solver can output the MI solution to a memory, and/or to a file of a file system, and the MAC can access the MI solution in the memory and/or file. In step 612 the MAC can receive the MI solution based on a communication from the MI Solver to the MAC (e.g., a CLI or API communication of the MI Solver), such as a return result or status of a function of the MI Solver that the MAC can use, in step 610, to invoke the MI Solver to solve the equations. The MAC receiving the MI solution, in step 612, can comprise a function of the MI Solver outputting the MI model, and/or can comprise the MAC receiving a result or status of a job of a computing system that the MAC can use in step 610 to invoke the MI Solver to solve the equations.

In step 614 the MAC generates globally optimized mapping decisions based on the MI solution. The globally optimized mapping decisions can optimize allocation of CGR hardware to execute nodes of the graph. In step 614 the MAC can, for example, output the globally optimized mapping decisions to a mapping decision space. The MAC can include the MI solution to an intermediate representation of the graph and/or mapping decisions.

In implementations, an IR of mapping decisions based on the MI solution can comprise, for example, IL statements representing the solution and/or globally optimized mapping decisions. The MAC can output the IR to, for example, to an AIR compiler, Template Graph Compiler, and/or PNR component of a compiler, such as in the example of compiler 320 in FIG. 3 . An IL compiler (e., a RAIL component such as in template library 324, and/or a PNR component of a compiler, such as PNR 325 in, FIG. 3 ) can translate the IL statements to hardware configuration and executions operations, such as in a PEF.

In step 616 the MAC can, optionally, generate and/or output a human readable format, such as a textual and/or graphical representation of the MI solution, and/or the globally optimized mapping decisions. The MAC can output a human readable representation to a programmer of the application model, and/or to an administrator of an CGRS to execute an application. The programmer and/or administrator can utilize the human readable mapping decisions to determine how to execute the application model, and/or to tune (or otherwise modify) the application and/or CGR hardware configurations or allocations.

FIG. 7 illustrates an example system for performing methods of the disclosure, and/or operations thereof, such as the example of method 600 of FIG. 6 . In FIG. 7 , system 700 is shown comprising compiler 710, APP 702, graph 714, hardware specification HW 712, MI Solver 704, processor 708A and processor 708B, and memory 740.

In implementations, APP 702 can be a dataflow application and graph 714 can be a computation graph representing APP 702. HW 712 can comprise a description of the CGR hardware, and can specify resources, configurations, structure, and/or capabilities of the CGR hardware for executing APP 702. Compiler 710 can be a CGRS compiler capable of compiling APP 702 to CGR hardware of a CGRS and, in particular, compiling APP 702 and mapping APP 702 to CGR hardware using an MI model. Compiler 710 (or, components of compiler 710) can generate graph 714 based on APP 702 or, alternatively, compiler 710 can receive, or retrieve, graph 714 as an input to a method such as method 600 for determining a globally optimized mapping of APP 702 for execution on CGR hardware described in HW 712.

FIG. 7 illustrates compiler 710 including MAC 720, which can be a MAC such as illustrated by the examples of FIGS. 4A and 4B. Compiler 710 and/or MAC 720 can perform methods, or operations of methods, such as method 500 of FIG. 5 and/or method 600 of FIG. 6 . In implementations, compiler 710 can receive APP 702, graph 714, and/or HW 712 via an API of the compiler and/or of an SDK associated with compiler 710 or a CGRS. Compiler 710 can retrieve APP 702, graph 714, and/or HW 712 from a file of a file system.

Compiler 710 and/or MAC 720 can generate an MI model such as described in reference to method 600 of FIG. 6 . As shown in FIG. 7 , MAC 720 includes MI model 730, which in turn includes decisions 732, constraints 734, and objectives 736. Decisions 732 can include variables and equations such as decision variables and equations generated by MAC 720 performing step 606 of method 600; constraints 734 can include variables and equations such as constraint variables and equations generated by MAC 720 performing step 608 of method 600; and, objectives 736 can comprise objective functions such as objective functions generated by MAC 720 performing step 604 of method 600. MAC 720 can output MI model 730 to MI Solver 704 to solve equations and objective functions included in MI model 730 to determine mapping solutions that achieve optimization objectives (e.g., minimal latency, maximum utilization, or a combination thereof).

MAC 720 can perform method 600, and/or operations of method 600, for example, to generate MI model 730, communicate MI model 730 to MI Solver 704, and receive the results of MI Solver 704 as MI Solution 738. Based on MI model 730, MI Solution 738 can comprise mapping solutions that can meet the objectives of objectives 736 for executing APP 702 on CGR hardware described by HW 712. MAC 720 is shown, in FIG. 7 , coupled to MI Solver 704 by interface 706A. In implementations, MAC 720 can use interface 706A to submit MI model 730 to MI Solver 704 for processing, and/or to instruct MI Solver 704 to process MI model 730. MAC 720 can use interface 706A to receive MI Solution 738 from MI Solver 704.

MI Solver 704 can be a program capable of processing an MI model comprising variables and equations such as illustrated in the examples of Appendix 1 and the preceding examples of the disclosure. For example, MI Solver 704 can be a commercially available program capable of computing solutions to equations of an MI model generated by compiler 710, such as previously referenced Gurobi® or Google® Optimizer Tools. MAC 720 is shown, in FIG. 7 , coupled to MI Solver 704 by interface 706A. In implementations, interface 706A can be, for example, an API and MAC 720 can use interface 706A to submit MI model 730 to MI Solver 704 for processing, and/or to receive MI Solution 738 from MI Solver 704.

FIG. 7 shows compiler 710 coupled to processor 708A and MI Solver 704 coupled to processor 708B. Compiler 710 can comprise programs (e.g., programs of MAC 720) that can execute on processor 708A (e.g., to generate MI model 730) and MI Solver 704 can comprise programs that can execute on processor 708B (e.g., to solve equations of MI model 730). Processor 708A and 708B can comprise any kind of computing processor capable of executing programs of compiler 710 and/or MI Solver 704. Processor 708A and/or 708B can comprise one or more processors of a host computing system included in, or coupled to, a CGRS. Processor 708B can comprise a processor of a computing system that can host a commercial MI Solver. Processor 708A and 708B can comprise the same processor, or set of processors, or can comprise different processors.

FIG. 7 further shows MI model 730 and MI Solution 738 stored in memory 740. In implementations memory 740 can comprise a main memory (e.g., a main memory of a host computer, not shown in FIG. 7 ), and/or a memory or storage medium (e.g., a solid state disk, or hard disk) of a storage system or subsystem included, or communicatively coupled to system 700. FIG. 7 shows memory 740 coupled to compiler 710 (and, as a component of compiler 710, additionally or alternatively MAC 720) via interface 706B, and memory 740 coupled to MI Solver 704 via interface 706C.

FIG. 7 illustrates memory 740 as a single memory; however, this is to simplify the example of system 700 and not intended to limit embodiments. It would be appreciated by one of ordinary skill in the art that, in implementations, memory 740 can comprise multiple memories and/or storage elements, and components and/or data of MI model 730 and/or MI Solution 738 can be in different memories/storage elements in any arbitrary combination suitable for compiler 710, MAC 720, and/or MI Solver 704 to access components and/or data of MI model 730 and/or MI Solution 738.

Compiler 710, MAC 720, and/or MI Solver 704 can access (read from and/or write to) MI model 730 and/or MI Solution 738 in memory 740 via interface 706A, 706B, and/or 706C (collectively, “interfaces 706”). Interfaces among interfaces 706 can comprise, for example, an API, a command line interface, or a communications interface. Interfaces among interfaces 706 can comprise an interface for accessing MI model 730 and/or MI Solution 738 in a storage subsystem, or via a network (e.g., the Internet). Interfaces among interfaces 706 can comprise software and/or firmware programs, hardware circuits, I/O buses and/or links, or any communications interface suitable for compiler 710, MAC 720, and MI Solver 704 to interoperate and/or to access MI Model 730 and/or MI solution in memory 740.

Heuristic Allocation

As previously described, a CGRS compiler can partition (tile) tensor data so that tensors can be most efficiently processed by CGR hardware. Determining optimal tile sizes and shapes of input tensor data can involve the compiler determining optimal tile sizes and shapes, based on characteristics of the underlying CGR hardware resources, and allocating particular CGR hardware resources to process the tiles. A CGRS compiler can organize operators of an application into sections based on particular tiling decisions of operands and results tensors of the operators.

A CGRS can comprise pools of hardware resources (e.g., sets of processors and/or memories) and mapping (tiling, for example) very large amounts of tensor data (e.g., millions or billions of tensor data elements), and associated CVNN operators, to large and complex CGRS resource pools can present a highly computationally complex problem. The reconfigurable nature of hardware in CGRS resource pools can particularly increase that inherent computational complexity. Thus, it is highly desirable and advantageous to implement efficient methods for a CGRS compiler (or components of a CGRS compiler, such as a MAC) to partition tensor data into suitable tile to most efficiently allocated CGR hardware to process the tensors.

As previously discussed, mapping decisions to tile tensor data and/or allocate CGR hardware to process the tensor data can be based on, or correspond to, particular application execution objectives, such as maximizing CGR hardware utilization, maximizing parallelism among application model operators, and/or minimizing execution latencies. In determining tiling decisions and CGR hardware allocations to process the tiles, a CGRS compiler must consider constraints imposed by the tensor data and CGR hardware.

For example, tiling tensor data and mapping tiles to CGR hardware resources can be dependent on constraints such as how many processing units (e.g., PCUs) are available to execute CVNN operators; how much memory (e.g., memory local to processing units, such as caches or scratchpad memories, and/or memories such as higher level caches or large DRAMs) are available to store tile data and results of processing tile data; how much computational latency, and/or how much data transfer latency, processing each node or section of a graph contributes; and, data dependencies among nodes and/or sections of a graph; and, what sections (e.g., tensor data tiles and/or operators) of the graph can execute in parallel.

A CGRS compiler can analyze an application graph to determine tiling decisions, determining section boundaries among operators of an application, and allocate CGR hardware to execute the application (or, sections of an application) based on application execution objectives. In implementations, a “heuristic (HR) mapper” that can perform a heuristic-based method to tile input tensor data, determine section boundaries based on the tiling decisions, and determine (or, propose) an optimal mapping of the tiles to CGR hardware. In particular, an HR mapper can optimize tiling and graph boundaries jointly to achieve high execution throughput on CGR hardware.

A CGRS compiler or, a MAC of a CGRS compiler, can include such an HR mapper component. Accordingly, to illustrate the disclosure, the disclosure describes an HR mapper component of a CGRS compiler (e.g., an HR mapper component of a MAC) performing methods and operations of the disclosure. However, this is not intended to limit implementations and it will be appreciated to one of ordinary skill to implement methods and operations of the disclosure in alternative structures.

FIGS. 8A and 8B illustrate example methods for an HR mapper to determine an optimal tiling of tensor data and CGR hardware allocations for application operators based on application and CGRS heuristics. In FIG. 8A, example method 800 illustrates an overall method for determining tiling and section decisions based on heuristics. In FIG. 8B, example method 830 illustrates a method to determine section boundaries based tiling decisions, such as tiling decisions determined in performing operations of method 800.

Implementations can utilize a computation graph corresponding to an application and, for purpose of illustrating the methods, FIG. 9A illustrates an example graph that an HR mapper can analyze in performing the methods. Graph 900 is shown in FIG. 9A comprising forward nodes 910 and backward nodes 914, which can be operators of, for example, a CVNN. Forward nodes 910 is shown comprising a successive set of forward operator nodes of the application: CONV2D 902A, RELU 902B, CONV2D 902C, RELU 902D, and MAXPOOL 902E (collectively, “nodes 902”). Backward nodes 914 is shown comprising a successive set of backward operator nodes of the application: CONV2D 912A, RELU 912B, CONV2D 912C, RELU 912D, and MAXPOOL 912E (collectively, “nodes 912”).

CONV2D 902A, CONV2D 902C, CONV2D 912A, and CONV2D 912C can be 2D convolutional operators and are shown in FIG. 9A taking respective WEIGHT 904A and WEIGHT 904B as one of two inputs to each operator, as well inputs as to respective backward 2D convolutional operators CONV2D 912A and CONV2D 912C. WEIGHT 904A and WEIGHT 904B can be, for example, a tile of a kernel tensor RELU 902B, RELU 902D, RELU 912B, and RELU 912D can be rectified linear unit operators and receive outputs tensor of respective operators CONV2D 902A, CONV2D 902C, MAXPOOL 912E, and CONV2D 912C. MAXPOOL 902E and MAXPOOL 912E can be max pooling operators. MAXPOOL 902E is shown receiving the output tensor of RELU 902D as an input, and MAXPOOL 902E is shown receiving input 918.

Forward nodes 910 receives input 906 to CONV2D 902A and outputs, from MAXPOOL 902E, output 908. Backward nodes 914 receives input 918 to MAXPOOL 912E and outputs, from CONV2D 902A, output 916. Inputs 906 and 918 and outputs 908 and 916 can comprise tiles of tensors. Input 906 can be a tile of, for example, an input image tensor to forward nodes 910, or can be output of another operator of the application not included in graph 900. Output 908 can be a tile of a tensor computed by MAXPOOL 902E. Input 918 can be a tile of, for example, an output tensor of a forward or backward node not shown in graph 900, or can be an output of MAXPOOL 912E to input 918. Output 916 can be a tile of a tensor computed by CONV2D 912A.

For purpose of illustrating the disclosure, methods 800 and 830 of respective FIGS. 8A and 8B are described as applied to a graph corresponding to a CVNN executed by a CGRS, and input/output data of operator nodes in a graph are described as tensors. Discussion of can the methods use the example of graph 900 in FIG. 9A to illustrate results of operations of the methods. However, this is not intended to limit implementations. One of ordinary skill in the art will appreciate that the methods can be applied to partitioning large sets of data, in addition or alternative to tensors, for processing and allocating processing resources in applications other than CVNNs and computing systems other than a CGRS.

It will be also appreciated by one of ordinary skill in the art that operations of the methods can be equally applied to a graph as a whole or, alternatively, to a subgraph comprising a subset of a larger graph. Thus, in describing example methods 800 and 830, references to “the graph” can apply equally to a graph as a whole, and a subgraph of a larger graph.

Turning then to FIG. 8A, in step 802 of method 800, the HR mapper receives (e.g., an input of an API or CLI), or otherwise accesses (e.g., from a file system), a computation graph (hereinafter, regarding methods of FIGS. 8A and 8B, “the graph”) representing a CVNN. Alternatively, the HR mapper can receive the CVNN and generate a corresponding application graph, and/or can modify an application graph received in step 802. In performing method 800, or operations of method 800, an HR mapper can generate an “auxiliary graph” to determine and/or represent mapping decisions.

In step 804, the HR mapper analyzes the graph to determine dimensions of tensor inputs to each operator in the graph as a basis to determine longest possible tiling chains. In implementations a tiling chain can comprise a series of connected operators in a graph that successively process an input tile. Successive operators in a graph that share a common number of tiles (“num_tiles”) of their input tensors can form a tiling chain. A longest tiling chain corresponds to the maximum number of operations in a series of pipelined operations. For example, in a graph comprising 3 convolution operators, Conv1->Conv2->Conv3, Conv1 can have a num_tiles equal to 4 (for example), Conv2 can have a num_tiles equal to 4 (for example), and Conv3 can have a num_tiles equal to 8 (for example). In this example, Conv3 cannot be in the same chain as Conv1 and Conv2 because Conv3 has a num_tiles different from Conv1 and Conv2. In this example Conv1->Conv2 forms one longest tiling chain and Conv3, along, forms a second longest tiling chain, as Conv3 cannot extend the Conv1->Conv2 chain and, if there is no potential successor operator to include with it, Conv3 becomes a longest possible chain that can include Conv3.

In step 804, the HR mapper can perform tensor partitioning to determine a num_tiles for input tensors based on the input tensor dimensions. The results num_tiles can represent a number of tiles to partition an input tensor, and can be an extra dimension added to an input tensor, which gives the tensor a shape that includes the added, num_tiles, dimension. A partitioned tensor can, then, have a shape that includes the num_tiles as an additional dimension, compared to its corresponding input tensor.

FIG. 9B illustrates an example of generating a partitioned tensor, using the example of graph 900 in FIG. 9A. In FIG. 9B, graph 920 illustrates partitioning input tensors to operators CONV2D 902A and RELU 902B. FIG. 9B shows input 906 to CONV2D 902A comprising an input tensor having dimensions 1×3×256×256, which for an input image tensor of a 2D convolutional operator can correspond to a number of data batches (1), a number of input channels (3), and a height and width dimension for each of the 3 channels (256×256). WEIGHT 904A is shown having dimensions 8×3×3×3 which for a kernel tensor of a 2D convolutional operator can correspond to an output channel (8), input channel (3), kernel height (3) and kernel width (3). The output of CONV2D 902A, given these inputs, is shown in FIG. 9B as a tensor having dimensions 1×3×256×256, which is then input to RELU 902B and the result output as another tensor having dimensions 1×3×256×256.

Graph 924, in FIG. 9B, illustrates the results of the HR mapper adding a num_tiles dimension to the shapes of a tensor. input as input 906 to CONV2D 902A, the output tensor of CONV2 902A as an input tensor to RELU 902B, and the output tensor of RELU 902B, which is (in graph 900 of FIG. 9A) an input tensor to CONV2D 902C. In graph 924, PART 926 is shown as a partitioned tensor having dimensions 1×1×3×256×256, with the second dimension (having initial value “1”) in the example of PART 926 corresponding to a num_tiles dimension. The added num_tiles dimension is carried forward as an added dimension in the shapes of the output tensors of CONV2D 902A and RELU 902B.

For an operator that receives the output of RELU 902B but is not, or might not be, included in the same tiling chain as CONV2D 902A and RELU 902B (as will be seen in the example of FIG. 9D), the HR mapper can generate a de-partitioned output tensor of RELU 902B, shown in FIG. 9B as DE-PART 928, having a shape with the un-partitioned RELU 902B output tensor dimensions 1×8×256×256.

To complete tensor partitioning, in step 804, an HR mapper can perform a shape fulfillment operation to determine a value of the num_tiles dimension of each partitioned tensor. That is, shape fulfillment can determine an actual number of tiles (an actual value of num_tiles) in each partitioned tensor. To determine values of num_tiles, the HR mapper can apply “shaping heuristics” associated with CGR hardware to execute the operators. In implementations, shaping heuristics can comprise, for example, an estimate of execution latency or floating point operations (“teraflops”, typically, in a CGRS); an estimate of computational overhead results from padding input tensors (which can be a function of the length of a tiling chain); memory usage, such as usage of memories local to a CGRP, to execute the operators in a chain; and/or a number of tiles that produces tensor shapes that are most compatible with characteristics of the CGR hardware (e.g., sizes of memories and/or stage buffers, or a type of CGRP to execute operators processing the partitioned tensors). The HR mapper can determine particular shaping heuristics and/or values of shaping heuristics, in step 804, based on a specification of the CGR hardware to execute the graph.

FIG. 9C illustrates an example of shape fulfillment based on the example of partitioning the FIG. 9B. In FIG. 9C shows PART 926, the output tensor of CONV2D 902A, and the output tensor of RELU 902B as partitioned tensors from graph 924 in FIG. 9B, having a num_tiles dimension with an initial value of 1. Based on partition heuristics the HR mapper can generate actual numbers of tiles to execute these operators on their respective input tensors, shown in FIG. 9C as a value of 16 after the HR mapper performs shape fulfillment on the partitioned tensors of graph 924 in FIG. 9B.

In analyzing the graph to partition tensors, determine the tiling chains, and/or to further map the graph to CGR hardware, in step 804 the HR mapper can traverse the graph and generate an auxiliary graph comprising operator nodes and input/output tensor edges of the graph. The HR mapper can assign names to corresponding nodes and/or edges in the auxiliary graph. In step 804, an auxiliary graph can comprise the graph received in step 802 modified by operations of method 800 and/or method 830.

In step 806, the HR mapper forms tiling chains based on the num_tiles of the partitioned tensors and determines the longest chains. In step 806 the HR mapper can analyze the partitioned tensors to determine which operators can form a tiling chain. Every operator that processes in input tensor that shares the same value of the num_tiles dimension of a corresponding partitioned tensor can be included in the chain. When a partitioned tensor input to a successive operator in a graph does not share the same value of the num_tiles dimension of an output partitioned tensor of a predecessor operator, that successive operator cannot be included in the tiling chain that includes that predecessor operator. In step 806 the HR mapper forms the longest tiling chains possible based on the partitioned tensors that input to each operator in the graph.

In step 808, the HR mapper generates “section proposals” that group operators within the tiling chains into sections that can maximize an execution objective, such as overall execution throughput of the application, or minimize an execution objective, such as minimizing execution latency of the application. To generate the section proposals, the HR mapper can perform example method 830, of FIG. 8B, or a method similar to example method 830, of FIG. 8B.

In determining operators to include in sections convolutional partitioning, the HR mapper can trace the auxiliary graph (e.g., using the names assigned in step 804) and can add a “partition dimension” to tensors in the chain selected in step 806. A partition dimension can represent how many tiles to execute within the chain. The HR mapper can modify the auxiliary graph to include operators representing inserting a partition dimension into (partitioning), and removing a partition dimension from (de-partitioning”), input/output tensors of auxiliary graph operators.

Step 808 produces an optimum section proposal, based on the partitioned tensors, that can achieve, or best achieve among a set of section proposals, an execution objective, such as maximizing execution throughput, maximizing CGR hardware utilization, or minimizing execution latency. In step 810, the HR mapper generates an IR that includes the optimum section proposals based on the tiling chains, modified input tensor shapes (that is, the partitioned tensor shapes), and number of tiles to utilize in each of the tiling chains. In step 810 the HR mapper can, for example, modify the graph to include the optimum section proposals, partitioned tensors, and number of tiles. The HR mapper can output the modifications to the graph in a memory and/or a storage medium, or directly to a component of a CGRS compiler to further compile the IR to CGRS executable statements (e.g., to compile the IR to low level execution statements and/or a PEF). A CGRS compiler can access the IR from a memory or storage medium to further compile the IR to CGRS executable statements.

In step 812, the HR mapper outputs the IR generated in step 810. The HR mapper can output the IR to a memory and/or to a file, for example. The compiler, or components of the compiler, can utilize the output IR, for example, to perform additional compilation operations of the application, such as to generate a low level representation (e.g., assembly language or PEF statements) of the graph and/or the application. The compiler, or components of the compiler, can utilize the output IR to determine additional and/or alternative mapping decisions and/or CGR hardware allocations based on the output IR. The compiler, or components of the compiler, can utilize the output IR to evaluate mapping decisions, to estimate application execution performance, for example.

Additionally, or alternatively, in step 812 the HR mapper can output the IR in a human readable form, such that an application developer can apply the results of method 800 to modify the application. The HR mapper can output the IR to, for example, a computing system included in, or coupled to, a system hosting the CGRS compiler. The HR mapper can output the IR map using any interface and/or API suitable for communicating the IR map to a recipient.

Based on the tensor partitioning results in steps 804 and 806, the HR mapper can determine section boundaries that include operators in longest tiling chains. FIG. 8B illustrates example method 830 to determine section cut proposals to execute the operators and tiling chains on CGR hardware. The section cut proposals can include, for example, compiler-assigned (e.g., MAC or HR mapper) names of tiles, tiling chains, operators, tensors, and/or CGR hardware units; tiling partitions; and/or optimization metrics (e.g., optimization metrics of individual operators, sections, and/or aggregate optimization metrics of an entire proposal). Section cut proposals can combine operators of the input graph, for example, that can maximize the number of operators within the tiling chains.

Similar to method 800 of FIG. 8A, for purposes of illustrating the method but also not intended to limit implementations, the method is described as performed by the HR mapper executing method 800 of FIG. 8B in step 808 of method 800. In describing method 830, example graph 900 in FIG. 9A serve to illustrate operations of the method.

In steps 832 the HR mapper creates a set of empty section cut proposals. In step 834 the HR mapper generates a first proposal (P1) comprising each operator in the graph assigned to a unique, individual section. For example, with reference to graph 900, in step 834 the HR mapper can create a first proposal comprising 10 sections, in which each of the 10 operators among forward nodes 910 and backward nodes 914, in FIG. 9A, is assigned to an individual section among the 10 sections in the first proposal.

In alternative initial proposal, the HR mapper can create a first proposal with fewer sections, such as in cases in which the HR mapper can determine that combining particular operators within a section (e.g., a set of operators within a longest tiling chain) is likely to yield an optimum section proposal (e.g., in a case where a set of operators in a tiling chain, or an entire tiling chain, is known to “fit” in available CGR hardware, or based on input directives to the compiler or HR mapper).

In step 836 the HR mapper updates the input/output tensors of each operator in each section based on host padding rules. In CVNNs, it can be necessary to “pad” the border (row/column) elements of an input tensor to retain an original tensor size during a convolution, as well as possibly to pad filters inside of a CVNN. “Zero padding”, by adding zero-valued border elements to a tensor, can produce output tensors of the same size as input tensors.

In step 836, if input/output tensors of the tiling chain require padding, the HR mapper updates tensor shapes of the input/output tensors based on the, padding rules of the tiling algorithm. Such rules can include, for example, which border elements can contain zero values, and/or how to pad the tensors to reduce, or maintain, a particular rate of decrease of spatial dimensions of the tensors.

In step 836 the HR mapper further computes optimization metrics for each section in the proposal and saves the proposal. Optimization metrics can include, for example, computational throughput (e.g., a number of computational operations/second), and/or computational and/or memory latencies. In step 836 the HR mapper saves proposal P1, comprising the updated tensor shapes, section cuts, and estimated section optimization metrics of each of the section cuts.

Using operators among forward nodes 910 in FIG. 9A, proposal 850 in FIG. 8B illustrates various possible proposals (in FIG. 8B, proposals 850 illustrate example proposals of only a subset of the operators in graph 900 of FIG. 9A, and are not meant to be exhaustive). In proposals 850, P1 is an example first proposal placing operators CONV2D 902A, RELU 902B, CONV2D 902C, RELU 902D, and MAXPOOL 902E into individual section proposals, respectively S0, S1, S2, S3, and S4.

In step 838, the HR mapper can recursively process previous section proposals determined in step 834 to add operators from one section of a previous proposal to another section in a new proposal. That is, in step 838 the HR mapper can select a proposal from among proposals previously determined in step 834, can select two or more successive operators or sections of the previous proposal, and try to combine operators of the selected sections.

In step 838, the HR mapper can try to combine operators in a new section proposal based on whether the operators can be performed together in the section, or must be included in separate sections. For example, operators that can be pipelined or overlapped, and/or operators having the same number of tiles, can be, potentially, included in the same section. On the other hand, an operator can be dependent on one or more other (e.g., predecessor) operators, or have different numbers of tile partitions, such that that operator cannot be included in a section that includes certain other operator(s).

Also, in step 838, the HR mapper can determine operators to include in a new section proposal based on whether the operators can all “fit” in the CGR hardware available, or configured, to execute the operators. A combined set of operators forming a section can “fit” in the CGR hardware, for example, if the operators, and/or dimensions and/or sizes of tensors, in combination, do not exceed the number of processors and/or number or size of memories of available CGR hardware.

To further illustrate recursive creation of alternative section proposals, with reference to proposal P1 in proposals 850, an HR mapper can create proposal P2 shown in proposals 850 as P2 section S0 comprising sections P1 S0 and P1 S1, and P2 sections S1-S3 comprising P1 sections S2-S4 of P1. If operators of P1 S0 and S1 can be combined (based on the nature of the operators and/or input/output tensor partitions), and the combination in P2 S0 can fit on the CGR hardware, then P2 can be a valid proposal. In another example the HR mapper can attempt to create section proposal P4 that combines, for example, proposal P2 section S1 operators and proposal P2 section S2 operators into a combined section S1 of proposal P2. Assuming that operators RELU 902B and CONV2D 902C can be in the same section and, in combination, can fit on CGR hardware, proposal P4 can be a valid proposal.

In step 840, for each valid, section proposal combining operators in different sections, the HR mapper updates section tensor shapes based on padding rules, estimates section optimization metrics, and saves the proposal. By processing section proposals recursively the HR mapper can generate a variety of alternative section proposals to determine an optimal set of sections to include in mapping decisions to execute a graph. Each of the alternative proposals can have proposed sections with varying optimization metrics, some proposals having aggregate optimization metrics as compared to others.

In step 842, the HR mapper can determine to discontinue generating proposals based on meeting particular end criteria. In implementations, criteria to end forming alternative proposals can include that all possible proposals (e.g., all section options) have been enumerated; that all operators have been included in a section proposal; a particular number, or fraction, of possible proposals have been enumerated; proposal computation time has reached a compute time, and/or computational utilization, limit; and that optimization metrics of remaining section cut options are not improving.

If, in step 842, the HR mapper determines that the proposal end criteria has not been met, the HR mapper can repeat steps 838 and 840 to create new section proposals to include operators not already included in a proposal, and/or to iterate over already generated proposals to generate additional section proposals based on the sections included in the already generated proposals. If, on the other hand, the end criteria has been met, in step 844 the HR mapper evaluates the saved proposals, generated in steps 832-840, based on the optimization metrics determined in steps 836 and 840, selects an optimal proposal from among the saved proposals, and outputs the optimal proposal.

In step 844 the HR mapper compares optimization metrics associated with the saved proposals, generated in steps 834-840, and selects and outputs an optimal proposal. For example, in step 844 the HR mapper can output as the optimal proposal the proposal that results in the highest computational throughput, greatest overlap of operator executions, fits best with the CGR hardware, and/or produces the lowest computational latencies. In step 844, the HR mapper can output the optimal proposal, for example, to step 810 of method 800 in FIG. 8A. Alternatively, the HR mapper can output the optimal proposal to a storage medium, such as for later processing by another component of a CGRS (e.g., another component of a CGRS compiler), and/or a human user (e.g., an application programmer).

In steps of methods 800, in FIG. 8A, and method 830, in FIG. 8B, the HR mapper can utilize a search space such as previously described. The search space can operate as a lexicon to associate operators, and their input operands and output results tensors, with particular mapping attributes, such as dimensions of the tensors on partition the tensors.

FIG. 9D illustrates an example of performing methods 800 and 830 using the example of graph 900 in FIG. 9A. In FIG. 9D graph 950 illustrates a graph generated from graph 900 based on optimum section proposals. In graph 950, operators of forward nodes 910 of graph 900 are shown combined in sections 952A and 952B, which can be optimal sections combining operators of forward nodes 910 as determined by performing methods 800 and 830. Operators of backward nodes 914 of graph 900 are combined, in graph 950, similarly in sections 954A and 954B, which can be optimal sections combining operators of backward nodes 914 as determined by performing methods 800 and 830.

Section 952A is shown comprising CONV2D 902A and section 952B is shown comprising RELU 902B, CONV2D 902C, RELU 902D, and MAXPOOL 902E. PART 942A and PART 942B represent partitioned tensors of respective input 906 and the input to RELU 902B in section 952B. DE-PART 944A, DE-PART 944B, DE-PART 944C, DE-PART 944D, and DE-PART 944E represent de-partitioned tensors (removing the num_tiles dimensions) of outputs of CONV2D 902A, RELU 902B, RELU 902D and MAXPOOL 902E, respectively, which can be then inputs to other operators of graph 900, and/or outputs of forward nodes 910.

Section 954A is shown comprising MAXPOOL 912E, RELU 912D, and CONV2D 912C, and section 954B is shown comprising RELU 912B and CONV2D 912A. PART 946A, PART 946B, PART 946C, and PART 946D represent partitioned tensors of input tensors to CONV2D 912C, RELU 912D, and MAXPOOL 912E. PART 946D, PART 946E, and PART 946F represent partitioned tensors of respective input tensors to RELU 912D and CONV2D 912A. DE-PART 948A represents a de-partitioned tensor of the output of CONV2D 912C as an input to section 954B, and DE-PART 948B represents a de-partitioned tensor of the output of CONV2D 912B, which can be, for example, an input to other operators of graph 900.

FIG. 10 illustrates an example system for performing heuristic mapping of an application to CGR hardware such as illustrated in the examples of the disclosure. In FIG. 10 , system 1000 is shown comprising APP 1002, HW Spec 1006, graph 1008, processor 1004, compiler 1020, and memory 1030. In implementations, processor 1004 can be a processor, or processors, included in a computing system (e.g., a host computing system of a CGRS or other computing system) and can be capable of executing compiler 1020 and/or APP 1002, and/or components thereof.

APP 1002 can be an application such described in the disclosure, in particular a dataflow app, or a CVNN. APP 1002 can be executable on a CGRS. Graph 1008 can be a computation graph representing APP 1002. Compiler 1020 can generate graph 1008 based on APP 1002.

FIG. 10 further illustrates compiler 1020 comprising MAC 1022 and HR mapper 1024. In implementations HR mapper 1024 can be a component or function of compiler 1020 separate from MAC 1022 or, alternatively, can be a component or function of MAC 1022. Compiler 1020 (or, components thereof, including components not shown explicitly in FIG. 10 ), MAC 1022 (or, components thereof, including components not shown explicitly in FIG. 10 ), and/or HR mapper 1024 (or, components thereof, including components not shown explicitly in FIG. 10 ) can perform methods, and/or operations of methods, of the disclosure. For example, MAC 1022 (or, components thereof), and/or HR mapper 1024 (or, components thereof) can perform methods, and/or operations of methods, of the to determine mapping decisions to map APP 1002 to CGR hardware of a CGRS for execution by the CGRS.

In implementations memory 1030 can comprise one or more memories, and/or storage media, and can store instructions, inputs, and/or results of operations of compiler 1020, MAC 1022, and/or HR mapper 1024. For example compiler 1020, MAC 1022, and/or HR mapper 1024 can execute on processor 1004 and can access (e.g., read or write) data in memory 1030 via an interface coupling processor 1004 and memory 1030, shown in FIG. 10 as interface 1010. Interface 1010 can comprise any interface suitable for processor 1004 and/or programs of compiler 1020, MAC 1022, and/or HR mapper 1024 to access memory 1030. Interface 1010 can comprise, for example, processor, I/O, and/or memory buses; I/O links; and/or network and/or storage interface devices.

Memory 1030 is depicted in FIG. 10 as including decisions 1032, section proposals 1034, aux graph 1036, and IR 1038. Compiler 1020, MAC 1022, and/or HR mapper 1024 can generate mapping decisions to map APP 1002 for execution on a CGRS and can store the mapping decisions in memory 1030 as decisions 1032. Similarly, compiler 1020, MAC 1022, and/or HR mapper 1024 can generate aux graph 1036 based on graph 1008 and/or APP 1002. For example, compiler 1020, MAC 1022, and/or HR mapper 1024 can generate aux graph 1036 based on modifications to graph 1008 associated with mapping decisions determined by compiler 1020, MAC 1022, and/or HR mapper 1024.

Compiler 1020, MAC 1022, and/or HR mapper 1024 can generate section proposals, such as described in method 800 of FIG. 8A and/or method 830 of FIG. 8B, and can store the section proposals in section proposals 1034. Compiler 1020, MAC 1022, and/or HR mapper 1024 can generate IR 1038 as an intermediate representation of graph 1008 and/or APP 1002 that incorporates mapping decisions among decisions 1032 and/or section proposals among section proposals 1034. Compiler 1020, MAC 1022, and/or HR mapper 1024 can utilize HW Spec 1006 to determine mapping decisions and/or section proposals. HW Spec 1006 can specify characteristics of CGR hardware, such as hardware parameters associated with hardware heuristics previously described.

Computer Program Product

Implementations can comprise a computer program product and can include a computer readable storage medium (or media) having computer readable program instructions of the computer program product incorporated therein. It will be understood by one of ordinary skill in the art that computer readable program instructions can implement each or any combination of operations and/or structure of the disclosure, such as illustrated by the drawings and described herein.

The computer readable program instructions can be provided to one or more processors, and/or other elements, of a computing system or apparatus to produce a machine which can execute, via the processor(s), to implement operations and/or actions similar or equivalent to those of the disclosure. The computer readable program instructions can be stored in a computer readable storage medium that can direct one or more processors, and/or other elements, of a computing system or apparatus to function in a particular manner, such that the computer readable storage medium comprises an article of manufacture including instructions to implement operations and/or structures similar or equivalent to those of the disclosure.

The computer readable program instructions of the computer program product can cause one or more processors to perform operations of the disclosure. A sequence of program instructions, and/or an assembly of one or more interrelated programming modules, of the computer program product can direct one or more one or more processors and/or computing elements of a computing system to implement the elements and/or operations of the disclosure including, but not limited to, the structures and operations illustrated and/or described in the present disclosure.

A computer readable storage medium can comprise any tangible (e.g., hardware) device, or combination of tangible devices, that can store instructions of the computer program product and that can be read by a computing element to download the instructions for use by a processor. A computer readable storage medium can comprise, but is not limited to, electronic, magnetic, optical, electromagnetic, and/or semiconductor storage devices, or any combination of these. A computer readable storage medium can comprise a portable storage medium, such as a magnetic disk/diskette, optical disk (CD or DVD); a volatile and/or non-volatile memory; a memory stick, a mechanically encoded device, and any combination of these. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as electrical signals transmitted through a wire, radio waves or other freely propagating electromagnetic waves, or electromagnetic waves propagating through a wave transmission medium (e.g., a wave guide or fiber-optic cable).

The computer readable program instructions can be communicated from the computer readable storage medium to the one or more computing/processing devices, via a programming API of a computing system, and/or a communications interface of a computing system, having access to the computer readable storage medium, and/or a programming API of a computing system, and/or a communications interface of the one or more computing/processing devices. The API(s) and/or communications interface(s) can couple communicatively and/or operatively to a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The API(s) and/or communications interface(s) can receive the computer readable program instructions read from computer readable storage medium and can forward the computer readable program instructions to the one or more computing/processing devices via the API(s), communications interface(s), and/or network.

In implementations, the computer readable program instructions of the computer program product can comprise machine language and/or assembly language instructions, instruction-set-architecture (ISA) instructions, microcode and/or firmware instructions, state-setting data, configuration data for integrated circuitry, source code, and/or object code. The instructions and/or data can be written in any combination of one or more programming languages.

The computer readable program instructions can execute entirely, or in part, on a user's computer, as a stand-alone software package; partly on a user's computer and partly on a remote computer; or, entirely on a remote computer. A remote computer can be connected to a user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN). In implementations, electronic circuitry including, for example, FPGA, PLAs, and or CGRPs can execute the computer readable program instructions by utilizing state information of the computer readable program instructions to configure the electronic circuitry to perform operations or elements of the disclosure, such as illustrated by the drawings and described herein.

In implementations, computer readable program instructions can also be loaded onto a computing system, or component(s) thereof, to cause the computing system and/or component(s) thereof to perform a series of operational steps to produce a computer implemented process, such that the instructions which execute on the computing system, or component(s) thereof, implement the operations or elements of the disclosure, such as illustrated by the drawings and described herein.

The flowchart and block diagrams in the Drawings and Incorporations illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various implementations of the present invention. Individual elements illustrated in the Figures—such as individual operations illustrated in the flowcharts or individual blocks of block diagrams—can represent a module, segment, or portion of executable instructions for implementing the disclosed function(s). In various alternative implementations, particular operations can occur in an order differing from that illustrated in the examples of the drawings. For example, two operations shown in succession in a diagram of the disclosure may, in a particular implementation, be executed substantially concurrently, or can sometimes be executed in a reverse order, depending upon the functionality involved. It will be further noted that particular blocks of the block diagrams, operations of the flowchart illustrations, and/or combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented using special purpose hardware and/or systems that, individually or in combination, perform the specified functions, acts, and/or computer instructions.

Terminology used herein, and the examples disclosed, are chosen to illustrate the principles of the implementations, the practical application or technical improvement over alternative technologies, and to enable others of ordinary skill in the art to understand the implementations disclosed herein. The disclosure illustrates various example implementations, and the examples are intended to illustrate principles and aspects of the disclosure, but are not intended to limit implementations, nor intended to be exhaustive of implementations that can be conceived within the scope of the disclosure. It would be apparent to one of ordinary skill in the art that alternative implementations can comprise modifications and combinations within the spirit of the disclosure and the scope of the claims.

As can be seen in the foregoing examples, features of the disclosure can comprise methods and apparati of computing systems. A summary of example implementations of such features includes:

Example Implementation 1

A method comprises: generating, by a compiler included in a first computing system, a MI (mixed integer) model to determine mapping decisions to map a dataflow application to hardware resources of a second computing system for the second computing system to execute the dataflow application, the MI model comprising MI equations to solve by an MI solver, the MI equations including equations of an objective function corresponding to an optimization objective; outputting, by the compiler, the MI model to the MI solver; invoking, by the compiler, the MI solver to compute an MI solution comprising solutions to equations among the equations included in the MI model; receiving, by the compiler, the MI solution; and, generating, by the compiler, a globally optimized mapping decision based on the MI solution.

Example Implementation 2

The example of implementation 1, wherein the objective function is expressed as a computation comprising an MI linear equation.

Example Implementation 3

The example of implementation 1, wherein equations among the MI equations comprise MI decision variables and MI decision equations.

Example Implementation 4

The example of implementation 1, wherein equations among the MI equations comprise MI constraint variables and MI constraint equations to include in the MI model.

Example Implementation 5

The example of implementation 4, wherein the MI constraint equations comprise equations selected from a group consisting of: node equations, bounds equations, data dependency equations, hardware usage equations; transfer size equations, and latency equations.

Example Implementation 6

The example of implementation 1, wherein the optimization objective is selected from a group consisting of: maximizing a processing throughput to execute the dataflow application by the second computing system; maximizing a number of processors to execute the dataflow application by the second computing system; maximizing a number of parallel operations to execute the dataflow application by the second computing system; minimizing a latency to execute the dataflow application by the second computing system; minimizing an amount of memory to execute the dataflow application by the second computing system; and, minimizing a number of data transfers to execute the dataflow application by the second computing system.

Example Implementation 7

The example of implementation 1, wherein the second computing system comprises a coarse grain reconfigurable system.

Example Implementation 8

The example of implementation 1, wherein the method further comprises generating, by the compiler, a human readable representation of the globally optimized mapping decision.

Example Implementation 9

The example of implementation 1, wherein the MI Solver comprises a commercially available MI Solver.

Example Implementation 10

A computer program product comprises a computer readable storage medium having first program instructions embodied therewith, wherein the first program instructions are executable by at least one processor to cause the at least one processor to: generate a MI (mixed integer) model to determine mapping decisions to map a dataflow application to hardware resources of a computing system to execute the dataflow application, the MI model comprising MI equations to solve by an MI solver, the MI equations including equations of an objective function corresponding to an optimization objective; output the MI model to the MI solver; invoke the MI solver to compute an MI solution comprising solutions to equations among the equations included in the MI model; receive the MI solution; and, generate a globally optimized mapping decision based on the MI solution.

Example Implementation 11

The example of implementation 10, wherein the program instructions are executable by the at least one processor to further cause the at least one processor to generate a human readable representation of the globally optimized mapping decision.

Example Implementation 12

A first computing system comprises: a graph corresponding to a dataflow application; a hardware specification describing hardware of a second computing system for executing the dataflow application; a first processor and a second processor; an MI (Mixed Integer) Solver; and, a compiler,

wherein the compiler is configured to execute on the first processor to: generate an MI model to determine mapping decisions to map the dataflow application to hardware resources of the second computing system to execute the dataflow application, the MI model comprising MI equations to solve by the MI solver, the MI equations including equations of an objective function corresponding to an optimization objective; output the MI model to the MI solver; invoke the MI solver to compute an MI solution comprising solutions to equations among the equations included in the MI model; receive the MI solution; and, generate a globally optimized mapping decision based on the MI solution; and,

wherein the MI Solver is configured to execute on the second processor to: access the MI model; solve equations among the MI equations; and, output the MI solution.

Example Implementation 13

The example of implementation 12, wherein the objective function is expressed as a computation comprising an MI linear equation.

Example Implementation 14

The example of implementation 12, wherein equations among the MI equations comprise MI decision variables and MI decision equations.

Example Implementation 15

The example of implementation 12, wherein equations among the MI equations comprise MI constraint variables and MI constraint equations to include in the MI model.

Example Implementation 16

The example of implementation 15, wherein the MI constraint equations comprise equations selected from a group consisting of: node equations, bounds equations, data dependency equations, hardware usage equations; transfer size equations, and latency equations.

Example Implementation 17

The example of implementation 12, wherein the optimization objective is selected from a group consisting of: maximizing a processing throughput to execute the dataflow application by the second computing system; maximizing a number of processors to execute the dataflow application by the second computing system; maximizing a number of parallel operations to execute the dataflow application by the second computing system; minimizing a latency to execute the dataflow application by the second computing system; minimizing an amount of memory to execute the dataflow application by the second computing system; and, minimizing a number of data transfers to execute the dataflow application by the second computing system.

Example Implementation 18

The example of implementation 12, wherein the second computing system comprises a coarse grain reconfigurable system.

Example Implementation 19

The example of implementation 12, wherein the compiler is further configured to execute on the first processor to generate a human readable representation of the globally optimized mapping decision.

Example Implementation 20

The example of implementation 12, wherein the MI Solver comprises a commercially available MI Solver. 

What is claimed is:
 1. A method, the method comprising: generating, by a compiler included in a first computing system, a MI (mixed integer) model to determine mapping decisions to map a dataflow application to hardware resources of a second computing system for the second computing system to execute the dataflow application, the MI model comprising MI equations to solve by an MI solver, the MI equations including equations of an objective function corresponding to an optimization objective; outputting, by the compiler, the MI model to the MI solver; invoking, by the compiler, the MI solver to compute an MI solution comprising solutions to equations among the equations included in the MI model; receiving, by the compiler, the MI solution; and, generating, by the compiler, a globally optimized mapping decision based on the MI solution.
 2. The method of claim 1, wherein the objective function is expressed as a computation comprising an MI linear equation.
 3. The method of claim 1, wherein equations among the MI equations comprise MI decision variables and MI decision equations.
 4. The method of claim 1, wherein equations among the MI equations comprise MI constraint variables and MI constraint equations to include in the MI model.
 5. The method of claim 4, wherein the MI constraint equations comprise equations selected from a group consisting of: node equations, bounds equations, data dependency equations, hardware usage equations; transfer size equations, and latency equations.
 6. The method of claim 1, wherein the optimization objective is selected from a group consisting of: maximizing a processing throughput to execute the dataflow application by the second computing system; maximizing a number of processors to execute the dataflow application by the second computing system; maximizing a number of parallel operations to execute the dataflow application by the second computing system; minimizing a latency to execute the dataflow application by the second computing system; minimizing an amount of memory to execute the dataflow application by the second computing system; and, minimizing a number of data transfers to execute the dataflow application by the second computing system.
 7. The method of claim 1, wherein the second computing system comprises a coarse grain reconfigurable system.
 8. The method of claim 1, wherein the method further comprises generating, by the compiler, a human readable representation of the globally optimized mapping decision.
 9. The method of claim 1, wherein the MI Solver comprises a commercially available MI Solver.
 10. A computer program product, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the program instructions are executable by at least one processor of a first computing system to cause the at least one processor to: generate a MI (mixed integer) model to determine mapping decisions to map a dataflow application to hardware resources of a computing system to execute the dataflow application, the MI model comprising MI equations to solve by an MI solver, the MI equations including equations of an objective function corresponding to an optimization objective; output the MI model to the MI solver; invoke the MI solver to compute an MI solution comprising solutions to equations among the equations included in the MI model; receive the MI solution; and, generate a globally optimized mapping decision based on the MI solution.
 11. The computer program product of claim 10, wherein the program instructions are executable by the at least one processor to further cause the at least one processor to generate a human readable representation of the globally optimized mapping decision.
 12. A first computing system comprising: a graph corresponding to a dataflow application; a hardware specification describing hardware of a second computing system for executing the dataflow application; a first processor and a second processor; an MI (Mixed Integer) Solver; and, a compiler, wherein the compiler is configured to execute on the first processor to: generate an MI model to determine mapping decisions to map the dataflow application to hardware resources of the second computing system to execute the dataflow application, the MI model comprising MI equations to solve by the MI solver, the MI equations including equations of an objective function corresponding to an optimization objective; output the MI model to the MI solver; invoke the MI solver to compute an MI solution comprising solutions to equations among the equations included in the MI model; receive the MI solution; and, generate a globally optimized mapping decision based on the MI solution; and, wherein the MI Solver is configured to execute on the second processor to: access the MI model; solve equations among the MI equations; and, output the MI solution.
 13. The first computing system of claim 12, wherein the objective function is expressed as a computation comprising an MI linear equation.
 14. The first computing system of claim 12, wherein equations among the MI equations comprise MI decision variables and MI decision equations.
 15. The first computing system of claim 12, wherein equations among the MI equations comprise MI constraint variables and MI constraint equations to include in the MI model.
 16. The first computing system of claim 15, wherein the MI constraint equations comprise equations selected from a group consisting of: node equations, bounds equations, data dependency equations, hardware usage equations; transfer size equations, and latency equations.
 17. The first computing system of claim 12, wherein the optimization objective is selected from a group consisting of: maximizing a processing throughput to execute the dataflow application by the second computing system; maximizing a number of processors to execute the dataflow application by the second computing system; maximizing a number of parallel operations to execute the dataflow application by the second computing system; minimizing a latency to execute the dataflow application by the second computing system; minimizing an amount of memory to execute the dataflow application by the second computing system; and, minimizing a number of data transfers to execute the dataflow application by the second computing system.
 18. The first computing system of claim 12, wherein the second computing system comprises a coarse grain reconfigurable system.
 19. The first computing system of claim 12, wherein the compiler is further configured to execute on the first processor to generate a human readable representation of the globally optimized mapping decision.
 20. The first computing system of claim 12, wherein the MI Solver comprises a commercially available MI Solver. 